-
Notifications
You must be signed in to change notification settings - Fork 708
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase in average collector duration of most of the collectors in recent versions #1658
Comments
To get an fair, comparable result, the 0.25.0 should be complied with the current go runtime first to see which difference comes from the go runtime itself. Secondly, I expect that most of diffs comes between 0.28.0 and 0.29.0. In 0.29.0, I redesign the http handler for prometheus. Before 0.29.0, time was counted once the collector was done. the collector wrote all the data to an global buffer of metrics. Much more time is not counted into the collect time. In 0.29.0, each collector has an own buffer and the collector waits until the buffer is written into the prometheus registry. This avoids an scenario, where is Could you also post the (closed, because i miss-click) |
Good point! I didn’t think we could expect a “degradation” from the runtime. It goes from:
Do you think it’s wort it to release a “v0.25.2” with an updated framework? Or not worth the trouble? |
I would not offer a full release, only a snapshot build from a PR. (let me know, if it fine) But please note my edits above. |
Honestly I was expecting something like that, because it explain why the change is so general.
v0.29 is much faster globaly because of the huge improvement in I'm fine with the explanation you gave about the change of what is counted in the duration. |
If you agree, I could offer a snapshot build which count time directly after scrape. That should be fair comparison here. I'm happy if someone could take the time to do some testing in the same ways as you currently doing it. But I don't want to stress you out. Otherwise, feel free to close the issue. Thats fine. |
But in future, I expected that the scrape times getting higher. Currently, the windows_exporter scrapes the Windows registry directly and tries to read the binary data. Thats fast and works in most of the times. The code is very complicated. There is a hidden feature toggle, where the collector uses the Win32 API to fetch the performance data. (Ref: #1350). Looking at 0.29 release, only the cpu collector take note of the feature flag. In general plan is that the PDH functions are used and enabled for the 1.0 release next year. But multiple syscalls are slower than that fetch and read the binary data from Windows Registry. |
I saw the feature toggle and from a production perspective the only metric that is relevant is the total scap time and CPU/Memory usage of the exporter. The calculation we do by collector is mostly there to detect a massive change in one collector. Been by the collector himself or a misconfiguration / change of our configuration manager. I'm fine with the current explanation and performance, especially since the service collector is now “fixed”. Reach to me in the future if you want me to compare performance of two release. |
Problem Statement
We are in the process of migrating our clients from the 0.25.* branch to the 0.29.* one and I always check for difference in collectors and perflib duration and this time the results are unexpected as pretty much all collectors show a huge degradation. Looking at the change logs I don't understand where the difference is coming from.
Some context:
Currently our system can compare the average performance of the following values
windows_exporter_collector_duration_seconds
for the followingcollector
:cpu, iis, logical_disk, mssql, net, os, process, service, system, textfile
+
windows_exporter_perflib_snapshot_duration_seconds
The current test sample is around 25 servers in
0.25.0
and the same number in0.29.1
.Multiples versions of Windows Server 2016, 2019 and 2022 on multiple sites.
We can increase the sample size to around 380 of each version if you want me to go further.
The result at the moment (average duration of all host duration over 20m with a scrape interval of 15s):
Even the
textfile
show a significant increase in duration and it is usually very stable (as it’s one of the simplest). Can it be something on the exporter engine that affect all collectors?Don’t get me wrong the performances are not by any mean bad, and the improvement in the service one is far better than the sum of degradation. I am just trying to know if it’s expected or not.
Difference over time is stable:
I can compute more statistics if needed
Environment
The text was updated successfully, but these errors were encountered: