Stabilize and extend Parca retention #630

bwplotka · 2023-12-12T10:24:44Z

I don't think we set any retention, but maybe it OOMs at some point? We see inconsistent Parca retentions, let's improve it.

E.g. running prombench for 4 days, yet only ~1h or data:

cc @bboreham @kakkoyun

bwplotka · 2025-01-31T09:43:30Z

It's also now just down, we need to stabilize this - use object storage or so to avoid OOMs and have profiles for longer.

bwplotka · 2025-02-18T08:44:22Z

For context on those who would like to help:

Use cases for cont. profiling

We need continuous profiling, mostly for the retrospective profile storage and to easily access it. Without cont. profiling we have to:

Go to (public) http://prombench.prometheus.io//prometheus-pr/debug/pprof (or to direct profile e.g. heap URL) to download the profile.
Explore locally or on pprof.me
Upload it to pprof.me or GitHub to share with others.
4-6: Same for http://prombench.prometheus.io//prometheus-release/debug/pprof

This is not super bad, but:

You need to remember the URL (we could add links to description)
This obv is not accessible when your turned off the benchmark. Just yesterday I stoped benchmark and noticed I forgot CPU profile...
A bit of manual steps.
If some event occur you can't take profile in exactly that moment. Specifically it's ideal if we had profiles from both Prometheus binaries in the same time for clear comparisions.

Current implementation

We have a Parca running in the cluster. We provide a link to it on each benchmark and it's accessible publicly.

Currently it is down (503 svc unavailable) because @bboreham scaled it down (AFAIK) because it was blocking something (is it because it’s doing CPU profile obtain and we cannot do another one manually?) and it was crashlooping a lot (OOM). I change replica to 1 just now for debugging.

TODO

In practice we need a stable solution where its maintenance effort is *lower than the fuss of going from manual steps of taking profiles. We can scrape profiles with increased interval too for stability. Maybe use GCS for storage? For sure some memory stability - it would be better to lose some old data than crashloop.

cc @metalmatze

bwplotka added the bug label Dec 12, 2023

bwplotka mentioned this issue Dec 12, 2023

Current state of the prometheus/test-infra #593

Closed

bwplotka changed the title ~~Extend Parca retention~~ Stabilize and extend Parca retention Jan 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stabilize and extend Parca retention #630

Stabilize and extend Parca retention #630

bwplotka commented Dec 12, 2023

bwplotka commented Jan 31, 2025

bwplotka commented Feb 18, 2025 •

edited

Loading

Stabilize and extend Parca retention #630

Stabilize and extend Parca retention #630

Comments

bwplotka commented Dec 12, 2023

bwplotka commented Jan 31, 2025

bwplotka commented Feb 18, 2025 • edited Loading

Use cases for cont. profiling

Current implementation

TODO

bwplotka commented Feb 18, 2025 •

edited

Loading