Derive metrics from ingested profiles #2908

kolesnikovae · 2024-01-10T07:18:55Z

Background

Pyroscope offers users a query interface for basic filtering and aggregation, enabling profiling data analysis. This includes the ability to query a profile and its associated time-series for a specific execution context, such as an HTTP endpoint:

Profiling data can serve as a source for deriving metrics (time series) in various scenarios:

Measuring resources consumed by specific job classes or execution scopes in the application.
Implementing scheduling, pacing, throttling, and rate limiting based on resource consumption for account management, billing, and accounting.
Managing resource quotas.
Continuous benchmarking.

Another area of interest is alerting, behavior analysis, and anomaly detection.

Runtime environments rarely allows for the collection of such statistics in a convenient way, and profiling data augmented with dynamic tags (sample labels) can allow us to fill this gap.

Problem

Pyroscope faces limitations hindering its use in scenarios mentioned above:

Sensitivity to the curse of dimensionality in sampled profiling data, particularly with dynamic labels of high cardinality.
Limited data manipulation capabilities in the Pyroscope query engine compared to traditional time-series databases.
Specificity of the Pyroscope query interface, making it incompatible with existing query languages and dialects (except Prometheus label matchers).

These limitations will be mitigated in the future but won't be completely eliminated.

Proposal

Code Instrumentation

I propose an SDK that enables users to define metrics directly in the code, similar to regular metrics definition. For example, in Go, a Prometheus-like interface could be implemented as follows:

// The timer/watcher is to be initialized similarly to any other prometheus
// metrics. The difference is that it does not need/support registration,
// and is never exported at the source (although it can be implemented).
var jobCPUTimer = cputimer.NewCPUTimerVec(cputimer.Opts{
	Name: "my_job_cpu_time_total",
}, []string{"class", "account"})

var userCPUTimer = jobCPUTimer.WithLabelValues("ETL", "user-1")

func doJob(ctx context.Context) {
	ctx, stop := userCPUTimer.Start(ctx)
	defer stop()
        // CPU-intensive job
}

Through pprof labels, samples of the timer scope are attributed to specific time series, such as my_job_cpu_time_total{class="ETL",account="user-1"}. Static profile labels (e.g., pod, service_name, region, etc.) also become part of the metric labels.

Recording rules

Additionally, users should be able to configure metric recording rules outside the code when instrumentation is not available or dynamic labels are not supported for a given profile type. These rules allow scoping metrics to a stack trace location or a function name, albeit with increased configuration complexity due to the volatile nature of stack traces.

Metrics export

On the server side, distributors extract configured and annotated time-series from sample values and proxy them to a time-series database in commonly used formats. Metric labels are removed from profiles to avoid issues associated with high cardinality of sample dimensions.

Prometheus remote write could be the default protocol, but it requires strict ordering of samples by timestamp within a single series (so does Mimir, although it can handle OOO writes). This may require us to introduce a dedicated component/service responsible for ordering, batching, sharding, and sending. A good example is Tempo's metrics-generator.

A very basic prototype (user code):

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Derive metrics from ingested profiles #2908

Derive metrics from ingested profiles #2908

kolesnikovae commented Jan 10, 2024 •

edited

Loading

Derive metrics from ingested profiles #2908

Derive metrics from ingested profiles #2908

Comments

kolesnikovae commented Jan 10, 2024 • edited Loading

Background

Problem

Proposal

Code Instrumentation

Recording rules

Metrics export

Related:

kolesnikovae commented Jan 10, 2024 •

edited

Loading