-
Notifications
You must be signed in to change notification settings - Fork 626
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Per-app usage metrics #3429
Conversation
pkg/distributor/distributor.go
Outdated
@@ -746,7 +754,7 @@ func (g *groupsWithFingerprints) add(stringTable []string, lbls phlaremodel.Labe | |||
}) | |||
} | |||
|
|||
func extractSampleSeries(req *distributormodel.PushRequest, relabelRules []*relabel.Config) (result []*distributormodel.ProfileSeries, bytesRelabelDropped, profilesRelabelDropped float64) { | |||
func extractSampleSeries(req *distributormodel.PushRequest, usageGroups *validation.TenantUsageGroups, relabelRules []*relabel.Config) (result []*distributormodel.ProfileSeries, bytesRelabelDropped, profilesRelabelDropped float64) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unfortunate this function signature had to change. This function previously returned all dropped bytes, so I couldn't use the return value as there may be series present that got dropped which need to specially labeled.
pkg/distributor/distributor.go
Outdated
// note(bryanhuhta): We don't need to label this metric with service | ||
// name as rate limited requests don't count towards billable bytes. | ||
validation.DiscardedBytes.WithLabelValues(string(validation.RateLimited), tenantID).Add(float64(req.TotalBytesUncompressed)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a valid assumption. When calculating billable bytes, the recording rules seem to subtract out any bytes that were rate limited.
pkg/distributor/distributor_test.go
Outdated
ug := &validation.TenantUsageGroups{ | ||
TenantID: "", | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only here because the extractSampleSeries
signature changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation of the service name allowlist.
Looks good. Few random thoughts:
I don't feel very strongly about 1 and 2, I do feel strongly on the third point. |
Adopt more flexible config structure
After @petethepig's comments, I made the following changes:
Here's it working in fire-dev-001: https://ops.grafana-ops.net/goto/yu16UVXIg?orgId=1 Using the following usage group definitions: distributor_usage_groups:
- pyroscope: '{service_name=~"fire-dev-001/.*"}'
- cortex-dev-01/ingester: '{service_name="cortex-dev-01/ingester"}' This groups all the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have a few concerns which I outlined in the line by line comments. Happy to dive into them a bit more if stuff is unclear.
It is also worth taking a look at how Mimir does it, because their implementaiton is obviously a bit more tested: https://github.com/grafana/mimir/blob/main/pkg/ingester/activeseries/custom_trackers_config.go
pkg/validation/usage_groups.go
Outdated
return nil, fmt.Errorf("no matchers for usage group %q and tenant %q", name, tenantID) | ||
} | ||
|
||
amMatchers, err := amlabels.ParseMatchers(matchersString) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I am missing something, but we probably should parse those the same way we do in the query path: https://github.com/simonswine/pyroscope/blob/ed0b5643f48792ac988d975e17e285b807d5c1f9/pkg/phlaredb/head.go#L440
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, a lot of this file was copied from how Mimir implemented this. So no, you aren't missing anything, I just happened to use Mimir's parse metrics approach instead of our own. I'll switch this.
pkg/validation/usage_groups.go
Outdated
|
||
// DistributorUsageGroups returns the usage groups that are enabled for this | ||
// tenant. | ||
func (o *Overrides) DistributorUsageGroups(tenantID string) (*UsageGroupConfig, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any returned error in this method will lead to profiling traffic being dropped. This is not ideal.
This seems quite a severe reaction to "I have wrong accounting usage groups".
I do suggest we should move all those errors into the parsing part of the overrides.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separately this method does a lot and is in the hot path of ingestion. We should also move as much as we can into the "loading the overrides" part of the code.
E.g. the parsing of label matchers should only happen only when the override is parsed and not for every profile ingested.
pkg/validation/limits.go
Outdated
@@ -50,6 +50,9 @@ type Limits struct { | |||
MaxProfileStacktraceDepth int `yaml:"max_profile_stacktrace_depth" json:"max_profile_stacktrace_depth"` | |||
MaxProfileSymbolValueLength int `yaml:"max_profile_symbol_value_length" json:"max_profile_symbol_value_length"` | |||
|
|||
// Distributor per-app usage breakdown. | |||
DistributorUsageGroups []map[string]string `yaml:"distributor_usage_groups" json:"distributor_usage_groups"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind explaining why this is a slice of maps, rather than just a map, which I would expect?
Here the yaml "explanation of my question":
current_usage_groups:
- my_team: '{namespace="cool-stuff"'
- other_team: '{namespace="boring-stuff"}'
why_not_usage_groups:
my_team: '{namespace="cool-stuff"}'
other_team: '{namespace="boring-stuff"}'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted the labels to be applied deterministically (the last usage group matched will be used). I originally used the why_not_usage_groups
approach, but if two usages groups matched a label set, we wouldn't know which one would get chosen.
distributor_usage_groups:
specific_cool_stuff: '{namespace="cool-stuff", service_name="thing"}'
general_cool_stuff: '{namespace="cool-stuff"}'
Depending on how each key got hashed into the map, we'd get one or the other of the two usage groups. We could simplify the config and matching logic if we relaxed the determinism constraint, but I worry this might skew results oddly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally get the order and deterministic result problem (now that you mentioned it😆 ), but I think the map inside the slice will still be non-deterministic in the current implementation.
I think maybe a slice of two string might be a better solution:
name string
selector string
}
or just sorting by the map key (name of the rule group).
Probably it is best to aim for consistency with other Grafana Products and do what Mimir does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation has had a few iterations and has changed significantly since I opened this PR. This is a recap of the implementation to-date.
Design
The idea of the usage group at this point is identical (or nearly so) to Mimir's approach. A usage group has a name and matcher(s). When a profile is ingested, it's labels are parsed and matched against each usage group. Any usage group it matches will be collected and the profile size will be counted by the pyroscope_usage_group_received_decompressed_total
metric. Later, if that profile is dropped for any reason, it the drop will be counted by the pyroscope_usage_group_discarded_bytes_total
metric.
As noted before, a profile can match 0 or more usage groups.
- If 0, then it's bucketed into a default usage group called "other"
- If 1 or more, the
pyroscope_usage_group_received_decompressed_total
will be counted with each matching usage group
As an example, consider the following labels attached to a profile:
[
{ "service_name": "foo" }
{ "namespace": "barbaz" }
]
along with the following usage groups:
distributor_usage_groups:
app/foo: '{service_name="foo"}'
namespace/bar: '{namespace=~"bar.*"}'
the pyroscope_usage_group_received_decompressed_total
metric will be counted once for app/foo
and once for namespace/bar
:
pyroscope_usage_group_received_decompressed_total{usage_group="app/foo"}
pyroscope_usage_group_received_decompressed_total{usage_group="namespace/bar"}
Implementation
I created a struct UsageGroupConfig
which is responsible for
- unmarshalling the usage group config (from both yaml and json)
- validating the group count (limit 50 for now)
- normalizing usage group names
- parsing and validating the matchers
This struct is built once at start up and any errors it generates will cause the app to fail to run. I believe this is reasonable.
UsageGroupConfig
has a method GetUsageGroups(phlaremodel.Labels)
which accepts a label set. It will match the label set against all the usage groups and return a list of all that match via the proxy object UsageGroupMatch
. UsageGroupMatch
has methods that can report a number as "bytes received" or "bytes dropped". This is where the logic of emitting a metric label for each usage group occurs.
Remarks
@simonswine identified one of the former approaches as being detrimental to the hot path both in aggressive failure cases and lots of new computation. Both these concerns are addressed by doing the bulk of the error-prone parsing and validation at startup, leaving simple accounting to the hot path.
Additionally, there was an ask to make this look and feel more like Mimir's approach. Whilst it isn't exactly the same, I'm hoping with the most recent changes we move much, much closer to their implementation. I took a lot of inspiration from how they implemented this feature, so the user-facing look-and-feel and the code of this feature should feel very similar to Mimir's.
Lastly, It's important to note that this approach isn't designed to be accurate down to the byte and penny. @petethepig and I agreed to take this approach because it should be approximately accurate and provide proportionally accurate results. So even if this feature doesn't report penny-exact numbers, the error should be proportionally the same across all usage groups. Big apps will produce big numbers, small apps will produce small numbers.
pkg/validation/usage_groups.go
Outdated
// TODO(bryanhuhta): We should probably validate the usage group name | ||
// is a valid label value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't figure out where prometheus does its validation of label values, but wherever it is, we should copy it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is any valid utf8 string: https://github.com/prometheus/client_golang/blob/9f203a098ec630d179ccef4efbaa8c3341291d00/prometheus/labels.go#L158
I suggest we also disallow "other" to avoid undefined behaviour.
// It should never be nil, but check just in case! | ||
if config == nil { | ||
config = &UsageGroupConfig{} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the current code, the config will never be nil. However, I think we should leave this check here so we treat a nil config like a empty config, in case the config could be nil at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this and also for providing so much detailed comments and ensure this is properly tested. ❤️
I would like you to take a look at my two suggestions regarding validation and tenant_id
flow. But I think this is already for a LGTM.
pkg/validation/usage_groups.go
Outdated
// TODO(bryanhuhta): We should probably validate the usage group name | ||
// is a valid label value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is any valid utf8 string: https://github.com/prometheus/client_golang/blob/9f203a098ec630d179ccef4efbaa8c3341291d00/prometheus/labels.go#L158
I suggest we also disallow "other" to avoid undefined behaviour.
pkg/validation/usage_groups.go
Outdated
if config.TenantID == "" { | ||
config.TenantID = tenantID | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels a bit sketchy to me, we despite not registering it as through "RegisterFlag" in theory you could still set in the pyroscope.yaml config a default usage groups for all tenants.
This bit would then expose the same *UsageGroupConfig to all tenant that don't override it, with a race to that bit which sets the tenant.
I think it would be better to do the tenant field in a wrapped type or even simpler put it to a parameter to the Count*()
methods
type UsageGroupConfig // without TenantID
type UsageGroupInstance{
*UsageGroupConfig
tenantID string
}
related https://github.com/grafana/pyroscope-squad/issues/162
Warning
This PR body is outdated, but I want to leave it here for posterity's sake. Please refer to this comment to get the most up-to-date and accurate summary of the changes in the PR.
This implements per-app usage for bytes ingested/dropped (both metrics required to calculate billable bytes). I took the following metrics
distributor_received_decompressed_bytes
discarded_bytes_total
and added the
service_name
label to them. This label value is typically blank, but when the tenant overrides have the following property setany series which has a
service_name
label that matches one ofdistributor_usage_groups
will haveservice_name
set on the ingest/dropped metrics. This should allow us to create recording rules to send to the billing cortex and let customers view per-app usage breakdowns.Note that there is a cost to adding an app to the allowlist. We should be prudent when adding services, especially for large customers.