Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase in average collector duration of most of the collectors in recent versions #1658

Closed
JDA88 opened this issue Sep 30, 2024 · 7 comments

Comments

@JDA88
Copy link
Contributor

JDA88 commented Sep 30, 2024

Problem Statement

We are in the process of migrating our clients from the 0.25.* branch to the 0.29.* one and I always check for difference in collectors and perflib duration and this time the results are unexpected as pretty much all collectors show a huge degradation. Looking at the change logs I don't understand where the difference is coming from.

Some context:
Currently our system can compare the average performance of the following values
windows_exporter_collector_duration_seconds for the following collector : cpu, iis, logical_disk, mssql, net, os, process, service, system, textfile
+
windows_exporter_perflib_snapshot_duration_seconds

The current test sample is around 25 servers in 0.25.0 and the same number in 0.29.1.
Multiples versions of Windows Server 2016, 2019 and 2022 on multiple sites.
We can increase the sample size to around 380 of each version if you want me to go further.

The result at the moment (average duration of all host duration over 20m with a scrape interval of 15s):

Collector 0.25.0 0.29.1 Delta Detail
service 107ms 3.08 ~ -97% Massive improvement with #1584
logical_disk 0.26ms 1.37ms ~ +400% Maybe #1442 ?
system 0.24ms 1.19ms ~ +380%
net 0.28ms 1.27ms ~ +350%
cpu 0.34ms 1.31ms ~ +300%
mssql 0.80ms 1.97ms ~ +150% Small sample for the moment
iis 0.81ms 1.87ms ~ +120%
perflib 7.57ms 13.8ms ~ +80%
textfile 1.52ms 2.59ms ~ +70% I see no reason for such a degradation
process 1.11ms 1.72ms ~ +50% #1472, #1592
os 1.69ms 2.74ms ~ +65%

Even the textfile show a significant increase in duration and it is usually very stable (as it’s one of the simplest). Can it be something on the exporter engine that affect all collectors?

Don’t get me wrong the performances are not by any mean bad, and the improvement in the service one is far better than the sum of degradation. I am just trying to know if it’s expected or not.

Difference over time is stable:
image

I can compute more statistics if needed

Environment

  • windows_exporter Version: 0.25.0 & 0.29.1
  • Windows Server Version: 2016, 2019 and 2022
@jkroepke
Copy link
Member

jkroepke commented Sep 30, 2024

To get an fair, comparable result, the 0.25.0 should be complied with the current go runtime first to see which difference comes from the go runtime itself.

Secondly, I expect that most of diffs comes between 0.28.0 and 0.29.0.

In 0.29.0, I redesign the http handler for prometheus. Before 0.29.0, time was counted once the collector was done. the collector wrote all the data to an global buffer of metrics. Much more time is not counted into the collect time.

In 0.29.0, each collector has an own buffer and the collector waits until the buffer is written into the prometheus registry. This avoids an scenario, where is collector_duration_seconds was sometimes missing.

Could you also post the scrape_duration_seconds of 0.25 and 0.29 which should cover the overall scrape time? Since 0.25 used a global buffer and needs much more mutex and locks, I would expect a lower overall scrape time.

(closed, because i miss-click)

@jkroepke jkroepke closed this as not planned Won't fix, can't repro, duplicate, stale Sep 30, 2024
@jkroepke jkroepke reopened this Sep 30, 2024
@JDA88
Copy link
Contributor Author

JDA88 commented Sep 30, 2024

To get an fair, comparable result, the 0.25.0 should be complied with the current go runtime first to see which difference comes from the go runtime itself.

Good point! I didn’t think we could expect a “degradation” from the runtime. It goes from:

0.25.0 => go_info{version="go1.23.1"}
0.29.1 => go_info{version="go1.21.5"}

Do you think it’s wort it to release a “v0.25.2” with an updated framework? Or not worth the trouble?

@jkroepke
Copy link
Member

jkroepke commented Sep 30, 2024

I would not offer a full release, only a snapshot build from a PR. (let me know, if it fine) But please note my edits above.

@JDA88
Copy link
Contributor Author

JDA88 commented Sep 30, 2024

In 0.29.0, I redesign the http handler for prometheus. Before 0.29.0, time was counted once the collector was done. the collector wrote all the data to an global buffer of metrics. Much more time is not counted into the collect time.

Honestly I was expecting something like that, because it explain why the change is so general.

Could you also post the scrape_duration_seconds of 0.25 and 0.29 which should cover the overall scrape time? Since 0.25 used a global buffer and needs much more mutex and locks, I would expect a lower overall scrape time.

v0.29 is much faster globaly because of the huge improvement in service collector:
image

It also use much less CPU:
image

I'm fine with the explanation you gave about the change of what is counted in the duration.
Tell me if you want me to collect more datas, if no you can close the question.

@jkroepke
Copy link
Member

If you agree, I could offer a snapshot build which count time directly after scrape. That should be fair comparison here.

I'm happy if someone could take the time to do some testing in the same ways as you currently doing it. But I don't want to stress you out. Otherwise, feel free to close the issue. Thats fine.

@jkroepke
Copy link
Member

But in future, I expected that the scrape times getting higher.

Currently, the windows_exporter scrapes the Windows registry directly and tries to read the binary data. Thats fast and works in most of the times. The code is very complicated.

There is a hidden feature toggle, where the collector uses the Win32 API to fetch the performance data. (Ref: #1350). Looking at 0.29 release, only the cpu collector take note of the feature flag. In general plan is that the PDH functions are used and enabled for the 1.0 release next year.

But multiple syscalls are slower than that fetch and read the binary data from Windows Registry.

@JDA88
Copy link
Contributor Author

JDA88 commented Sep 30, 2024

I saw the feature toggle and from a production perspective the only metric that is relevant is the total scap time and CPU/Memory usage of the exporter. The calculation we do by collector is mostly there to detect a massive change in one collector. Been by the collector himself or a misconfiguration / change of our configuration manager.

I'm fine with the current explanation and performance, especially since the service collector is now “fixed”.

Reach to me in the future if you want me to compare performance of two release.
Unfortunately, it’s unpractical for us to test snapshot builds on more than a few hosts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants