Increase in average collector duration of most of the collectors in recent versions #1658

JDA88 · 2024-09-30T11:25:58Z

Problem Statement

We are in the process of migrating our clients from the 0.25.* branch to the 0.29.* one and I always check for difference in collectors and perflib duration and this time the results are unexpected as pretty much all collectors show a huge degradation. Looking at the change logs I don't understand where the difference is coming from.

Some context:
Currently our system can compare the average performance of the following values
windows_exporter_collector_duration_seconds for the following collector : cpu, iis, logical_disk, mssql, net, os, process, service, system, textfile
+
windows_exporter_perflib_snapshot_duration_seconds

The current test sample is around 25 servers in 0.25.0 and the same number in 0.29.1.
Multiples versions of Windows Server 2016, 2019 and 2022 on multiple sites.
We can increase the sample size to around 380 of each version if you want me to go further.

The result at the moment (average duration of all host duration over 20m with a scrape interval of 15s):

Collector	0.25.0	0.29.1	Delta	Detail
service	107ms	3.08	~ -97%	Massive improvement with #1584
logical_disk	0.26ms	1.37ms	~ +400%	Maybe #1442 ?
system	0.24ms	1.19ms	~ +380%
net	0.28ms	1.27ms	~ +350%
cpu	0.34ms	1.31ms	~ +300%
mssql	0.80ms	1.97ms	~ +150%	Small sample for the moment
iis	0.81ms	1.87ms	~ +120%
perflib	7.57ms	13.8ms	~ +80%
textfile	1.52ms	2.59ms	~ +70%	I see no reason for such a degradation
process	1.11ms	1.72ms	~ +50%	#1472, #1592
os	1.69ms	2.74ms	~ +65%

Even the textfile show a significant increase in duration and it is usually very stable (as it’s one of the simplest). Can it be something on the exporter engine that affect all collectors?

Don’t get me wrong the performances are not by any mean bad, and the improvement in the service one is far better than the sum of degradation. I am just trying to know if it’s expected or not.

Difference over time is stable:

I can compute more statistics if needed

Environment

windows_exporter Version: 0.25.0 & 0.29.1
Windows Server Version: 2016, 2019 and 2022

The text was updated successfully, but these errors were encountered:

jkroepke · 2024-09-30T11:47:54Z

To get an fair, comparable result, the 0.25.0 should be complied with the current go runtime first to see which difference comes from the go runtime itself.

Secondly, I expect that most of diffs comes between 0.28.0 and 0.29.0.

In 0.29.0, I redesign the http handler for prometheus. Before 0.29.0, time was counted once the collector was done. the collector wrote all the data to an global buffer of metrics. Much more time is not counted into the collect time.

In 0.29.0, each collector has an own buffer and the collector waits until the buffer is written into the prometheus registry. This avoids an scenario, where is collector_duration_seconds was sometimes missing.

Could you also post the scrape_duration_seconds of 0.25 and 0.29 which should cover the overall scrape time? Since 0.25 used a global buffer and needs much more mutex and locks, I would expect a lower overall scrape time.

(closed, because i miss-click)

JDA88 · 2024-09-30T11:59:49Z

To get an fair, comparable result, the 0.25.0 should be complied with the current go runtime first to see which difference comes from the go runtime itself.

Good point! I didn’t think we could expect a “degradation” from the runtime. It goes from:

0.25.0 => go_info{version="go1.23.1"}
0.29.1 => go_info{version="go1.21.5"}

Do you think it’s wort it to release a “v0.25.2” with an updated framework? Or not worth the trouble?

jkroepke · 2024-09-30T12:00:33Z

I would not offer a full release, only a snapshot build from a PR. (let me know, if it fine) But please note my edits above.

JDA88 · 2024-09-30T12:22:32Z

In 0.29.0, I redesign the http handler for prometheus. Before 0.29.0, time was counted once the collector was done. the collector wrote all the data to an global buffer of metrics. Much more time is not counted into the collect time.

Honestly I was expecting something like that, because it explain why the change is so general.

Could you also post the scrape_duration_seconds of 0.25 and 0.29 which should cover the overall scrape time? Since 0.25 used a global buffer and needs much more mutex and locks, I would expect a lower overall scrape time.

v0.29 is much faster globaly because of the huge improvement in service collector:

It also use much less CPU:

I'm fine with the explanation you gave about the change of what is counted in the duration.
Tell me if you want me to collect more datas, if no you can close the question.

jkroepke · 2024-09-30T12:33:40Z

If you agree, I could offer a snapshot build which count time directly after scrape. That should be fair comparison here.

I'm happy if someone could take the time to do some testing in the same ways as you currently doing it. But I don't want to stress you out. Otherwise, feel free to close the issue. Thats fine.

jkroepke · 2024-09-30T12:42:09Z

But in future, I expected that the scrape times getting higher.

Currently, the windows_exporter scrapes the Windows registry directly and tries to read the binary data. Thats fast and works in most of the times. The code is very complicated.

There is a hidden feature toggle, where the collector uses the Win32 API to fetch the performance data. (Ref: #1350). Looking at 0.29 release, only the cpu collector take note of the feature flag. In general plan is that the PDH functions are used and enabled for the 1.0 release next year.

But multiple syscalls are slower than that fetch and read the binary data from Windows Registry.

JDA88 · 2024-09-30T12:51:19Z

I saw the feature toggle and from a production perspective the only metric that is relevant is the total scap time and CPU/Memory usage of the exporter. The calculation we do by collector is mostly there to detect a massive change in one collector. Been by the collector himself or a misconfiguration / change of our configuration manager.

I'm fine with the current explanation and performance, especially since the service collector is now “fixed”.

Reach to me in the future if you want me to compare performance of two release.
Unfortunately, it’s unpractical for us to test snapshot builds on more than a few hosts.

JDA88 added the ❓ question label Sep 30, 2024

jkroepke closed this as not planned Won't fix, can't repro, duplicate, stale Sep 30, 2024

jkroepke reopened this Sep 30, 2024

JDA88 closed this as completed Sep 30, 2024

jkroepke mentioned this issue Nov 12, 2024

windows_exporter produces high CPU load #1726

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase in average collector duration of most of the collectors in recent versions #1658

Increase in average collector duration of most of the collectors in recent versions #1658

JDA88 commented Sep 30, 2024

jkroepke commented Sep 30, 2024 •

edited

Loading

JDA88 commented Sep 30, 2024

jkroepke commented Sep 30, 2024 •

edited

Loading

JDA88 commented Sep 30, 2024

jkroepke commented Sep 30, 2024

jkroepke commented Sep 30, 2024

JDA88 commented Sep 30, 2024

Increase in average collector duration of most of the collectors in recent versions #1658

Increase in average collector duration of most of the collectors in recent versions #1658

Comments

JDA88 commented Sep 30, 2024

Problem Statement

Environment

jkroepke commented Sep 30, 2024 • edited Loading

JDA88 commented Sep 30, 2024

jkroepke commented Sep 30, 2024 • edited Loading

JDA88 commented Sep 30, 2024

jkroepke commented Sep 30, 2024

jkroepke commented Sep 30, 2024

JDA88 commented Sep 30, 2024

jkroepke commented Sep 30, 2024 •

edited

Loading

jkroepke commented Sep 30, 2024 •

edited

Loading