Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize PrometheusCollector scrape efficiency. #2974

Closed
wants to merge 24 commits into from

Conversation

bwplotka
Copy link

@bwplotka bwplotka commented Oct 22, 2021

Fixes kubernetes/kubernetes#104459

This might end up as some reusable framework for CRI and others, but for now attempting to optimize cadvisor in a bespoke fashion.

CPU profile: https://share.polarsignals.com/e46845c/
Memory profile: https://share.polarsignals.com/9cb2d48/

Related to prometheus/client_golang#917

Signed-off-by: Bartlomiej Plotka bwplotka@gmail.com

@google-cla google-cla bot added the cla: yes label Oct 22, 2021
@k8s-ci-robot
Copy link
Collaborator

Hi @bwplotka. Thanks for your PR.

I'm waiting for a google member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Creatone
Copy link
Collaborator

/ok-to-test

@bwplotka
Copy link
Author

Results so far:

./benchstat -delta-test=none ../_dev/client_golang/bench/out_base1.txt ../_dev/client_golang/bench/out_new3.txt
name                                         old time/op    new time/op    delta
PrometheusCollector_Collect/AlwaysUpdate-12    17.0ms ± 0%     5.2ms ± 0%  -69.32%

name                                         old alloc/op   new alloc/op   delta
PrometheusCollector_Collect/AlwaysUpdate-12    8.31MB ± 0%    0.95MB ± 0%  -88.58%

name                                         old allocs/op  new allocs/op  delta
PrometheusCollector_Collect/AlwaysUpdate-12      208k ± 0%       25k ± 0%  -87.85%
/benchstat -delta-test=none ../_dev/client_golang/bench/out_base1.txt ../_dev/client_golang/bench/out_new3.txt
name                                   old time/op    new time/op    delta
PrometheusCollector_Collect/Cached-12    17.0ms ± 0%     1.2ms ± 0%  -93.14%

name                                   old alloc/op   new alloc/op   delta
PrometheusCollector_Collect/Cached-12    8.31MB ± 0%    0.00MB ± 0%  -99.94%

name                                   old allocs/op  new allocs/op  delta
PrometheusCollector_Collect/Cached-12      208k ± 0%        0k ± 0%  -99.96%

NOTE: This is missing some validity checks and single flight on HTTP handler for full experience. It also shows how fast/efficient we can get this.

Yet I have already a bit better idea on a bit better abstraction that can be similarly efficient if not better but more readable.

@bwplotka
Copy link
Author

This is for 20 containers, 5k metrics.

@bwplotka
Copy link
Author

@smarterclayton
Copy link
Contributor

Very nice. There are definitely a lot of very large collectors out there that are transforming from one representation to another where series changes are a small fraction of the total set each scrape - encouraging to see the discussion towards a pattern there (apiserver in kube, haproxy/other-proxy -> source format)

for _, cont := range containers {
values := make([]string, 0, len(rawLabels))
labels := make([]string, 0, len(rawLabels))
values := values[:0]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to the fact I did talk about this https://twitter.com/bwplotka/status/1452971992324390918?t=Ufm9UR3YGf-ud8JimO-tZw&s=19 Ivan from audience found this should reuse variable not create another one! ((:

@derekwaynecarr
Copy link
Collaborator

@bwplotka this is very promising to see the delta!

// * returns cached results if until time expires.
//
// It implements prometheus.rawCollector.
func (c *PrometheusCollector) Collect() []*dto.MetricFamily {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This causes a compilation failure against Kubernetes:

+++ [1028 13:00:49] Building go targets for linux/amd64:
    ./pkg/kubelet
# k8s.io/kubernetes/pkg/kubelet/server
pkg/kubelet/server/server.go:383:50: cannot use "k8s.io/kubernetes/vendor/github.com/google/cadvisor/metrics".NewPrometheusCollector(prometheusHostAdapter{...}, containerPrometheusLabelsFunc(s.host), includedMetrics, "k8s.io/kubernetes/vendor/k8s.io/utils/clock".RealClock{}) (type *"k8s.io/kubernetes/vendor/github.com/google/cadvisor/metrics".PrometheusCollector) as type prometheus.Collector in argument to r.RawMustRegister:
	*"k8s.io/kubernetes/vendor/github.com/google/cadvisor/metrics".PrometheusCollector does not implement prometheus.Collector (wrong type for Collect method)
		have Collect() []*io_prometheus_client.MetricFamily
		want Collect(chan<- prometheus.Metric)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. This is because we need to use this collector in new promhttp.TransactionalHandler similar how i implemented it on cadvisor. Looks like this has to be changed on kubelet too

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a bunch of digging and the fix won't be quite as trivial or straightforward as it was in this PR. This is because in k8s interaction with Prometheus client libraries goes through the component-base metrics libraries, and it's not clear to me what the best way forward is. I pinged @logicalhan to get some feedback.

xref:

Annoyingly, while it would be nice to stop using k8s' deprecated RawMustRegister, it looks like that is still used in a couple of other places, so we can't just get rid of it and define something better for this use case:

https://cs.k8s.io/?q=RawMustRegister&i=nope&files=&excludeFiles=&repos=kubernetes/kubernetes

@dgrisonnet
Copy link

@bwplotka I wonder if we could still make some performance improvements here by using a notification-based approach for updating the cache. As far as I understand your current approach, is a session-based one that regenerates all the metrics during each session collection, but more efficiently than previously because of the existence of a cache. However, we can expect that most of the container stats will not change between 2 collections, so we could save a bit of CPU by only updating the cache whenever a new change in a container occurs. I am not knowledgeable about cadvisor codebase so I don't know if that would be feasible, but as far as I can tell, container watchers seem to exist in the codebase:

cadvisor/manager/manager.go

Lines 267 to 268 in 0549d48

containerWatchers []watcher.ContainerWatcher
eventsChannel chan watcher.ContainerEvent

This is the current approach that kube-state-metrics is taking as it watches events on Kubernetes objects and update its metric cache only when an object is created/updated/deleted.

So my suggestion would be to have two different kinds of cache collectors in client_golang. The first one session-based that you already proposed in prometheus/client_golang#929, which is very suitable for projects that don't have the possibility to watch for events. If node_exporter isn't using inotfiy watches on the sysfs/procfs, it might be a potential user of this kind of collector. The other type would be even more performant but would require the existence of a notification system and would be able to update the cache directly whenever a new event requiring metrics update is detected. This approach would have direct access to the cache and would lock it during collection, so it wouldn't have to share two different stores, as the session-based approach currently does.
Mixing both approaches could also maybe be an option?

Let me know what are you thoughts on that, just some ideas I am throwing in.

metrics/prometheus.go Outdated Show resolved Hide resolved
Comment on lines 1808 to 1820
errorsGauge := 0
if err := c.collectVersionInfo(session); err != nil {
errorsGauge = 1
klog.Warningf("Couldn't get version info: %s", err)
}
if err := c.collectContainersInfo(session); err != nil {
errorsGauge = 1
klog.Warningf("Couldn't get containers: %s", err)
}

// Describe describes all the metrics ever exported by cadvisor. It
// implements prometheus.PrometheusCollector.
func (c *PrometheusCollector) Describe(ch chan<- *prometheus.Desc) {
c.errors.Describe(ch)
for _, cm := range c.containerMetrics {
ch <- cm.desc([]string{})
session.MustAddMetric(
"container_scrape_error", "1 if there was an error while getting container metrics, 0 otherwise",
nil, nil, prometheus.GaugeValue, float64(errorsGauge), nil,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spent some time today trying to get this working properly in k8s.

I am finding that I keep running into errors... If I don't set IdType: cadvisorv2.TypeName in the cadvisor options, I get

ehashman@fedora:~/src/k8s$ curl http://127.0.0.1:8001/api/v1/nodes/127.0.0.1/proxy/metrics/cadvisor
# HELP cadvisor_version_info A metric with a constant '1' value labeled by kernel version, OS version, docker version, cadvisor version & cadvisor revision.
# TYPE cadvisor_version_info gauge
cadvisor_version_info{cadvisorRevision="",cadvisorVersion="",dockerVersion="",kernelVersion="5.10.0-9-amd64",osVersion="Debian GNU/Linux 11 (bullseye)"} 1
# HELP container_scrape_error 1 if there was an error while getting container metrics, 0 otherwise
# TYPE container_scrape_error gauge
container_scrape_error 1

and error in the logs

W1124 16:45:01.711801 1279370 prometheus.go:1815] Couldn't get containers: invalid request type ""

If I do set it, I get

ehashman@fedora:~/src/k8s$ curl http://127.0.0.1:8001/api/v1/nodes/127.0.0.1/proxy/metrics/cadvisor
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "error trying to reach service: EOF",
  "reason": "ServiceUnavailable",
  "code": 503
}

Not sure what I'm doing wrong.

k8s WIP PR is kubernetes/kubernetes#106334

And the patch I'm applying on top of the PR to get to that second error case is

diff --git a/pkg/kubelet/server/server.go b/pkg/kubelet/server/server.go
index ce3fc942e49..be9cb5e2603 100644
--- a/pkg/kubelet/server/server.go
+++ b/pkg/kubelet/server/server.go
@@ -378,7 +378,16 @@ func (s *Server) InstallDefaultHandlers() {
                includedMetrics[cadvisormetrics.AcceleratorUsageMetrics] = struct{}{}
        }
 
-       br.MustRegisterRaw(metrics.NewPrometheusCollector(prometheusHostAdapter{s.host}, containerPrometheusLabelsFunc(s.host), includedMetrics, clock.RealClock{}))
+       cadvisorOpts := cadvisorv2.RequestOptions{
+               IdType:    cadvisorv2.TypeName,
+               Count:     1,
+               Recursive: true,
+       }
+
+       containersCollector := metrics.NewPrometheusCollector(prometheusHostAdapter{s.host}, containerPrometheusLabelsFunc(s.host), includedMetrics, clock.RealClock{})
+       containersCollector.SetOpts(cadvisorOpts)
+       br.MustRegisterRaw(containersCollector)
+
        br.MustRegister(metrics.NewPrometheusMachineCollector(prometheusHostAdapter{s.host}, includedMetrics))
        s.restfulCont.Handle(cadvisorMetricsPath,
                promhttp.HandlerForTransactional(br, promhttp.HandlerOpts{ErrorHandling: promhttp.ContinueOnError}),

@bwplotka
Copy link
Author

Updated for the newest client_golang changes (prometheus/client_golang#929)

Should be ready for review. There are some observations for further improvements, but otherwise, all races should be fixed and we should see significant improvement in allocs and cpus.

TODO still:

  • Fix a few failing tests.
  • Benchmark
  • Run with Kubelet to and measure/ check functionally.

cc @ehashman @dgrisonnet

FYI: I have to take 2w off due to post knee surgery recovery, but anyone is welcome to continue on this!

@bwplotka bwplotka marked this pull request as ready for review January 27, 2022 11:34
@bwplotka
Copy link
Author

Ofc @dgrisonnet said, move to pure watcher based approach would be extremely beneficial, but it needs huge changes, so keeping that off for now.

@ehashman
Copy link
Collaborator

I have updated my k/k PR https://github.com/kubernetes/kubernetes/pull/107960/files#diff-c3e7f724d1b0dcee40df80716ed57d90d2649710150fa92bf9822bdad35e0429 which integrates this change + the required changes to rebase on top of runc 1.1.0.

I still am not getting any metrics... any time I try to hit the kubelet endpoint, I get an empty response and a corresponding logline of:

I0210 15:37:17.477351  121888 httplog.go:131] "HTTP" verb="GET" URI="/metrics/cadvisor" latency="91.027µs" userAgent="curl/7.74.0" audit-ID="" srcIP="127.0.0.1:51322" resp=0

@bwplotka
Copy link
Author

Ack @ehashman taking a look today/tomorrow (:

```
/tmp/GoLand/___BenchmarkPrometheusCollector_Collect_in_github_com_google_cadvisor_metrics.test -test.v -test.paniconexit0 -test.bench ^\QBenchmarkPrometheusCollector_Collect\E$ -test.run ^$ -test.benchtime 10s
goos: linux
goarch: amd64
pkg: github.com/google/cadvisor/metrics
cpu: Intel(R) Core(TM) i7-9850H CPU @ 2.60GHz
BenchmarkPrometheusCollector_Collect
BenchmarkPrometheusCollector_Collect-12    	     748	  14929127 ns/op	 8253796 B/op	  207706 allocs/op
PASS

Process finished with the exit code 0
```

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
@bwplotka
Copy link
Author

This PR is ready to go IMO.

All benchmarked. With kubelet we see significantly less allocations and CPU usage (kubernetes/kubernetes#108194)

@bwplotka
Copy link
Author

cc @ehashman @dgrisonnet

@bwplotka
Copy link
Author

On kubelet side I see only one discrepancy between this PR and without this PR:

image

Worth double checking before merging. Otherwise no blocker.

@bwplotka
Copy link
Author

@bwplotka
Copy link
Author

TODO: Clean up the commits.

dto "github.com/prometheus/client_model/go"
)

//
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uncomment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in a587512.

Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>
Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>
@tossmilestone
Copy link

@bwplotka hi, is this pr still in progress? We have encountered the probelm either.

@bwplotka
Copy link
Author

bwplotka commented May 1, 2024

Wow, sorry for lag. This work was paused and got stale. I changed jobs since then, so closing for now. Efficiency improvements are still very possible here, but ideally we redesign cadvisor metric tracking and that's a bigger work.

Something to discuss with cadvisor maintainers cc @cwangVT @SergeyKanzhelev

@bwplotka bwplotka closed this May 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Steady state kubelet CPU usage is high due to excessive allocation in prometheus scraping of cadvisor
10 participants