Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor prometheus metrics #335

Merged
merged 12 commits into from
Aug 23, 2021
Merged

Conversation

eh-am
Copy link
Collaborator

@eh-am eh-am commented Aug 12, 2021

This is prework for #332

PR is not ready, it's just to get some feedback early on


Started getting rid of the metrics package, so far only moved out the Counters and started moving out the Gauge stuff

Tests were breaking with variations of

panic recovered: duplicate metrics collector registration attempted; goroutine 23

So ended up taking the registry as a dependency, as per
prometheus/client_golang#716 (comment) 's suggestion
which in consequence required updating the constructor in multiple places

I will double check all this work later, since this is not something very testable (I pondered with the idea of somehow hitting /metrics and parsing the output but meh)

Here's a diff for reference. One clear difference is that lazy instanciated metrics (such as evictions_count are now eagerly initiated with 0)

                                                                <
cache_dicts_hit                                                    cache_dicts_hit
cache_dicts_miss                                                   cache_dicts_miss
cache_dicts_size                                                   cache_dicts_size
cache_dimensions_hit                                               cache_dimensions_hit
cache_dimensions_miss                                              cache_dimensions_miss
cache_dimensions_size                                              cache_dimensions_size
cache_segments_hit                                                 cache_segments_hit
cache_segments_miss                                                cache_segments_miss
cache_segments_size                                                cache_segments_size
cache_trees_hit                                                    cache_trees_hit
cache_trees_miss                                                   cache_trees_miss
cache_trees_size                                                   cache_trees_size
cpu_utilization                                                    cpu_utilization
disk_dicts                                                         disk_dicts
disk_dimensions                                                    disk_dimensions
disk_local_profiles                                                disk_local_profiles
disk_main                                                          disk_main
disk_segments                                                      disk_segments
disk_trees                                                         disk_trees
evictions_alloc_bytes                                              evictions_alloc_bytes
                                                                >  evictions_count
evictions_total_bytes                                              evictions_total_bytes
evictions_used_perc                                                evictions_used_perc
go_gc_duration_seconds_count                                       go_gc_duration_seconds_count
go_gc_duration_seconds{quantile="0"}                               go_gc_duration_seconds{quantile="0"}
go_gc_duration_seconds{quantile="0.25"}                            go_gc_duration_seconds{quantile="0.25"}
go_gc_duration_seconds{quantile="0.5"}                             go_gc_duration_seconds{quantile="0.5"}
go_gc_duration_seconds{quantile="0.75"}                            go_gc_duration_seconds{quantile="0.75"}
go_gc_duration_seconds{quantile="1"}                               go_gc_duration_seconds{quantile="1"}
go_gc_duration_seconds_sum                                         go_gc_duration_seconds_sum
go_goroutines                                                      go_goroutines
go_info{version="go1.15.1"}                                     |  go_info{version="go1.16.6"}
go_memstats_alloc_bytes                                            go_memstats_alloc_bytes
go_memstats_alloc_bytes_total                                      go_memstats_alloc_bytes_total
go_memstats_buck_hash_sys_bytes                                    go_memstats_buck_hash_sys_bytes
go_memstats_frees_total                                            go_memstats_frees_total
go_memstats_gc_cpu_fraction                                        go_memstats_gc_cpu_fraction
go_memstats_gc_sys_bytes                                           go_memstats_gc_sys_bytes
go_memstats_heap_alloc_bytes                                       go_memstats_heap_alloc_bytes
go_memstats_heap_idle_bytes                                        go_memstats_heap_idle_bytes
go_memstats_heap_inuse_bytes                                       go_memstats_heap_inuse_bytes
go_memstats_heap_objects                                           go_memstats_heap_objects
go_memstats_heap_released_bytes                                    go_memstats_heap_released_bytes
go_memstats_heap_sys_bytes                                         go_memstats_heap_sys_bytes
go_memstats_last_gc_time_seconds                                   go_memstats_last_gc_time_seconds
go_memstats_lookups_total                                          go_memstats_lookups_total
go_memstats_mallocs_total                                          go_memstats_mallocs_total
go_memstats_mcache_inuse_bytes                                     go_memstats_mcache_inuse_bytes
go_memstats_mcache_sys_bytes                                       go_memstats_mcache_sys_bytes
go_memstats_mspan_inuse_bytes                                      go_memstats_mspan_inuse_bytes
go_memstats_mspan_sys_bytes                                        go_memstats_mspan_sys_bytes
go_memstats_next_gc_bytes                                          go_memstats_next_gc_bytes
go_memstats_other_sys_bytes                                        go_memstats_other_sys_bytes
go_memstats_stack_inuse_bytes                                      go_memstats_stack_inuse_bytes
go_memstats_stack_sys_bytes                                        go_memstats_stack_sys_bytes
go_memstats_sys_bytes                                              go_memstats_sys_bytes
go_threads                                                         go_threads
process_cpu_seconds_total                                          process_cpu_seconds_total
process_max_fds                                                    process_max_fds
process_open_fds                                                   process_open_fds
process_resident_memory_bytes                                      process_resident_memory_bytes
process_start_time_seconds                                         process_start_time_seconds
process_virtual_memory_bytes                                       process_virtual_memory_bytes
process_virtual_memory_max_bytes                                   process_virtual_memory_max_bytes
promhttp_metric_handler_requests_in_flight                         promhttp_metric_handler_requests_in_flight
promhttp_metric_handler_requests_total{code="200"}                 promhttp_metric_handler_requests_total{code="200"}
promhttp_metric_handler_requests_total{code="500"}                 promhttp_metric_handler_requests_total{code="500"}
promhttp_metric_handler_requests_total{code="503"}                 promhttp_metric_handler_requests_total{code="503"}
                                                                >  retention_count
storage_dicts_read                                                 storage_dicts_read
storage_dicts_write                                                storage_dicts_write
storage_dimensions_read                                            storage_dimensions_read
storage_dimensions_write                                           storage_dimensions_write
                                                                >  storage_reads_total
storage_segments_read                                              storage_segments_read
storage_segments_write                                             storage_segments_write
storage_trees_read                                                 storage_trees_read
storage_trees_write                                                storage_trees_write
                                                                >  storage_writes_total
write_back_count                                                   write_back_count
write_back_timer                                                   write_back_timer

eh-am added 3 commits August 12, 2021 19:23
remove all calls to metrics.Counter
and instead inline using the prometheus/promauto package
that's used to give more control and therefore
create unique registers per tests
avoiding the "duplicate metrics collector registration attempted" panic
message
see prometheus/client_golang#716 (comment)
@eh-am eh-am marked this pull request as draft August 12, 2021 22:38
@codecov
Copy link

codecov bot commented Aug 12, 2021

Codecov Report

Merging #335 (5dbc22d) into main (12f8317) will decrease coverage by 0.04%.
The diff coverage is 56.97%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #335      +/-   ##
==========================================
- Coverage   53.99%   53.96%   -0.03%     
==========================================
  Files         102      102              
  Lines        4775     4967     +192     
==========================================
+ Hits         2578     2680     +102     
- Misses       1958     2048      +90     
  Partials      239      239              
Impacted Files Coverage Δ
pkg/cli/server.go 7.90% <0.00%> (ø)
pkg/server/build.go 100.00% <ø> (ø)
pkg/server/controller.go 31.64% <ø> (-0.16%) ⬇️
pkg/server/handler.go 0.44% <0.00%> (-<0.01%) ⬇️
pkg/util/debug/cpu.go 0.00% <0.00%> (ø)
pkg/util/debug/reporter.go 0.00% <0.00%> (ø)
pkg/storage/periodic.go 75.44% <70.38%> (-1.64%) ⬇️
pkg/storage/cache/cache.go 55.89% <80.00%> (-1.26%) ⬇️
pkg/convert/profile_extra.go 85.30% <100.00%> (ø)
pkg/storage/storage.go 66.75% <100.00%> (+9.52%) ⬆️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b32a38b...5dbc22d. Read the comment docs.

@petethepig
Copy link
Member

petethepig commented Aug 13, 2021

@eh-am This all looks great, thank you!

So ended up taking the registry as a dependency, as per prometheus/client_golang#716 (comment) 's suggestion which in consequence required updating the constructor in multiple places

Generally we try to avoid global state anyway, so this is great

One clear difference is that lazy instanciated metrics (such as evictions_count are now eagerly initiated with 0)

This is fine, I think it's actually better if they're eagerly initiated like that.

eh-am added 7 commits August 13, 2021 18:55
couple things:
1 - semantics have changed, ie the metrics are initialized to 0
even if we can't ever generate metrics data

so think in a dashboard a 0 value vs simply nothing showing up

2 - some of the metrics generated were dynamic, for example,
based on files in a directory
that has been hardcoded
@eh-am
Copy link
Collaborator Author

eh-am commented Aug 15, 2021

Hey, with the newest updates the metrics package is gone.

I've created a docker-compose in /hacks with grafana+prometheus+2 pyroscope instances + 2 go agents to help me compare the changes against a reference pyroscope instance, here's a screenshot of the dashboard (it's pretty raw).

screencapture-localhost-3000-d-4pHwiH77k-pyroscope-raw-metrics-comparison-2021-08-15-18_19_53

I think all this can be deleted, but I left it in the PR so that anybody can run and validate the refactored metrics are working as they should.

Apart from that, there are a few points worth mentioning:

  1. Moved everything out of the metrics package, so few things were inlined and copy pasted, which may feel weird.
    Some of them will probably go away in the future, for example our timing metrics expose nanoseconds, however prometheus docs says to use base units (seconds in this case, see https://prometheus.io/docs/instrumenting/writing_exporters/#naming). so all the * 1_000_000_000 calls would go away

Another example is the type conversion, which was previously centralized, which kinda suggested to pass interface {} vars and hope for the best. Therefore moving closer to the actual prometheus metrics call may make the required conversions clearer.

  1. I previously mentioned this but certain metrics semantics changed. For example, there were quite a few metrics that were lazily loaded, which means that they may never be initialized! For example https://github.com/pyroscope-io/pyroscope/blob/2723d116d03511afac40dc7cc581661d91cd606e/pkg/util/debug/cpu.go#L9-L18
    if the call to gopsutil continously failed, it would never be returned and therefore the metric would never exist.
    Now that we are eagerly initializing it, it's reported as 0. Which changes the semantics (it's not that it's using 0% of the CPU).
    Don't know how bad this would be.

  2. In a similar note, certain metrics that were very dynamic, such as the disk data structures (https://github.com/pyroscope-io/pyroscope/blob/2723d116d03511afac40dc7cc581661d91cd606e/pkg/util/debug/disk.go#L13-L23) are eagerly initialized to 0. I don't know if this dynamicity is actually useful to us, if so let me know so that I can rollback this part.

  3. Certain metrics weren't tested because I couldn't reproduce easily (think retention_count, evictions_count, retention_timer etc). It's in my screenshot above.

As a bonus, here's the output of promlinter list .

TYPE                NAME
GAUGE               benchmark
GAUGE               run_progress
COUNTER             upload_errors
COUNTER             successful_uploads
COUNTER             storage_writes_total
COUNTER             write_back_count
COUNTER             evictions_count
COUNTER             retention_count
COUNTER             storage_reads_total
GAUGE               evictions_alloc_bytes
GAUGE               evictions_total_bytes
GAUGE               evictions_used_perc
GAUGE               storage_caches_flush_timer
GAUGE               storage_badger_close_timer
GAUGE               evictions_timer
GAUGE               write_back_timer
GAUGE               retention_timer
GAUGE               cpu_utilization
GAUGE               disk_local_profiles
GAUGE               disk_main
GAUGE               disk_segments
GAUGE               disk_trees
GAUGE               disk_dicts
GAUGE               disk_dimensions
GAUGE               cache_dimensions_size
GAUGE               cache_segments_size
GAUGE               cache_dicts_size
GAUGE               cache_trees_size

@eh-am eh-am marked this pull request as ready for review August 15, 2021 22:00
@eh-am eh-am marked this pull request as draft August 18, 2021 18:35
@eh-am eh-am marked this pull request as ready for review August 18, 2021 19:04
Copy link
Member

@petethepig petethepig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks great, I will merge

@petethepig petethepig merged commit 673ae2b into grafana:main Aug 23, 2021
korniltsev pushed a commit that referenced this pull request Jul 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants