-
-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Interleaved benchmark execution (multiple commands) #21
Comments
I'd be interested in putting this functionality into the tool. I guess we'd take it in as an option and have to change how the execution state is currently being tracked. Unless you were of the opinion that this would be the new default way to run timings. |
I would like to discuss this first before we put this into action. In principle, the rationale behind this feature sounds sensible. However, we also have a few features that already work in a similar direction:
Adding a
|
Also, simplicity is a desirable goal for me. If there are command line options that we don't necessarily need, I would tend to leave them out. |
That's true, the benefits are minimal and the downsides/problems presented may not be worth it. While an interesting idea in general, it may not be a correct approach for this tool. If there are already enough other facilities available in the existing feature set, this should probably be closed and we can move on to the other issues. |
I'm benchmarking some code that interacts heavily with the filesystem and OS, and sorely missing this functionality. Instantaneous load spikes are one thing, since they can be removed by outlier detection or at least detected by a high variance. But there are also many ways in which the state of the system can vary slowly over time. For example:
When system state changes slowly enough, it won't show up in the form of outlier runs; it will just bias the results in favor of whichever parameters were tested under more favorable conditions. While not a panacea, interleaving would greatly mitigate this effect. |
Thank you very much for the detailed explanation. All of this sounds reasonable. The strategy how this could be used in hyperfine is not completely clear to me (see "disadvantages" above), but I am opening this for further discussion. |
I find myself in the thermal throttling on a laptop camp as well. I also have other cases that don't make sense — for example, a recent test run I did was:
My "workaround" is to try and scale up the warmup and test iterations in a (futile?) attempt to try to heat-saturate the processor so that the before and after binaries run under similar throttled conditions. Another potential solution for this particular case would be to run tests like:
And ensure that the A->B and B->A runs report "close enough" to each other. |
I think we should try to find a solution for that and document it here in the repository. I'm not yet convinced that the proposal in this ticket is the best solution (it might be). Alternative solutions that come to my mind right now:
Could you go into more detail? These numbers are pretty much meaningless without the standard deviation. Could it be just noise or is this reproducible? Are you sure that all caching effects are eliminated (either via warmup runs or by resetting the caches with |
I'd rather not... but I will 😢 In that specific case, I was neglecting to build my binary correctly. I was doing That being said, that's an example of some of the variance I get between three test runs of 10 or 25 iterations (I forget which at this point). |
Ok, but do you consider this to be a problem? That's like a ~5% variation in runtime. What did |
Not much to add here but just to say I'd also be happy to have this option. I've been benchmarking a very IO heavy program (reading & writing GBs of data) and the benchmarks are very unreliable from one run to the next (even though I'm clearing disk caches on every iteration). Which version of the program is run first influences a lot the results. |
If this is a consistent result, I assume that some other form of caching is going on somewhere. |
True. But if it's, say, an SSD's internal SLC cache, I don't think there's any way to clear it other than just waiting. Even if you do wait, both the filesystem's and the SSD's internal data structures are going to end up in different states than they started in, which might affect performance, albeit probably not by much. |
Hi, Just realized that I accidentally made a duplicate of this issue and wanted to post some of my context here. I have run into issues with benchmarks that are very sensitive to minor condition changes, but have clear speed difference when run correctly. For example, I have done some benchmarking of compiling rather small files and also doing linking. That does a solid amount of file IO and will always have a decent amount of deviation. I have found that some of these cases, If the benchmarks are run non-interleaved, I will have to do something like I have also seen similar cases when measure certain hot loop code that has rather minor performance optimization. If I simply run the scripts non-interleaved, and don't do it for long enough, the first run script will almost always be the faster. To my knowledge, this is simply due to the cpu heating up and thermal throttling. Something I was running earlier today, I started testing with I guess as an extra note, the other option to enable shorter runs while getting decent results is to fully reboot the machine and make sure not to open apps that could interfere with benchmarking. This works, but is a lot of extra effort to deal with something that is mostly fixed by interleaving. Hopefully these comments help to present a clear picture of why this feature could be extremely useful. PS I run zfs even on a laptop and it can really lead to less consistent with benchmarks that have any sort of file io even with prepare dropping caches. |
Oh, one extra note, I think standard error is much more important than standard deviation in these cases. For example, if you have a script that takes 20 millisecond to run with a standard deviation of 5 ms, and another script that only takes 18 ms, but the standard deviation is also 5 ms, looking at mean + standard deviation will never tell you which is faster. That being said, after enough runs the stardard error will trend towards zero making it clear that the 18 ms script is actually faster. We may want to also expose standard error, because it will help to properly distinguish whether the difference was just due to noise and regular deviation, or if the results were meaningfully different. |
Thanks for bringing this up. I have thought about this often, and I was never sure which would be better to show. Both values are very useful. The problem is that most users probably don't know the difference very well. But you are absolutely right, of course. A command with 18 ± 5 ms can be faster than 20 ± 5 ms in a statistically significant way, but there is no way to tell from the current output. We do, however, ship a We also have a But I'm also open to discussions whether or not this should be in the default output of hyperfine. |
It would be great if you could perform such a benchmark in a naive kind of way and share the results. Something like
I would love to take a closer look at the results and the progression of the runtimes. |
Ok, so here is an example of a command that timing can be a nightmare for. So this command over a very long term average, will run in about 18ms. If you look at the sample timing that I posted, the timing average until line 646 is approximately 18ms, but after line 646, the average looks to be closer to 12ms. I am almost 100% certain the timing issues are related to the amount of mmapping and file io done, but I am specifically trying to time the command with a warm cache. This command is essentially never used with a cold cache and those timings are not helpful. When timing this command against other variants of itself, or similar commands, they will generally run into the same sorts of crazy timing jumps. Generally speaking, when interleaved, both versions get timing peaks and valleys at the same time. Hopefully this gives enough context even if it isn't exactly what you asked for. |
I just accidentally recorded an even better example. I wanted to test 3 versions of an executable against a different command. I expected hyperfine would run: Anyway, the different command took:
Depending on which run I look at, it could be faster or about the same speed as most of the other commands I was testing. Hopefully my wording made sense here. |
Very interesting. Thank you very much. Here is how a histogram of the first benchmark looks like ( Could you please tell us a bit more about the command you are running and maybe the system that you are running it on (if possible)? What kind of background processes are running (browser, spotify, dropbox)? Also related: #481 |
Thanks for pointing out the 4 peaks. I am used to this having 2 peaks but not 4. Turns out that a daemon I had running in the background was turning turbo on and off. That is what was leading to the much larger distribution swing. That also makes sense as to why interleaving would stabilize the issue. Turbo mode would be on for plenty long enough that all the commands would execute multiple times before turning back off. So all commands would see the same highs and lows from turbo mode. Anyway, this still leaves me with 2 peaks that cause problems when comparing multiple versions of this command. I just collected a much longer trace with turbo mode always on. This recording was done, in performance mode, with turbo always on, after a restart (and minor delay), with the internet off, and only a terminal + htop open. I scanned through all of the running processes, and nothing looks more interesting than what would be found in a normal ubuntu install. Generaly speaking, the terminal windows had the highest CPU usage at about 2% because this app is so short running and does a lot of file io. Extra note about my setup, I run ZFS as my file system, which I am sure creates a more complex caching and file io scenario, but I am not sure if it is the root cause for the deviation. For more context on the command, this is compiler and linker running. So the base process is about what would be expected: load source files, process them until you have assembly, write that to the file system, do linking to get an executable. The language is a bit weird because it embeds into a binary compiled by another language. Due to this property, we have our own custom linker that surgically appends our output object on to an existing binary updating what is needed. As such, we do a lot of file io and memory mapping: Copy preprocessed binary to output location, mmap it, load a metadata file, mmap our output bytes, copy the bytes over, do a bunch of random updates across the file to make sure it is ready, and finally dump it to disk. We also can run with a traditional linker. There are the timings from running with a regualar linker. Looking at these timings, they are much much smoother (though maybe still bimodal?). The regular linker actually does much cleaner file io due to not mmaping and writing to disk linearly after building in memory. Given how much more bimodal the surgical linker is, I would posit that all of the memory mapping and potentially random file io is causing the clear peaks, but that is really just a guess. |
Very interesting. This is how all 10,000 benchmark runs (except for a few slow outliers outside of the y-axis range) look like in sequence: We can clearly see the bimodal distribution. We can also see that the pattern stays constant over time. Except for a few interferences(?) around n=1550 and n=8700. For comparison, this is how the regular linker benchmark looks like: And for completeness, here are the two histograms: Given that the bimodal pattern stays constant over time, I actually don't really see why an interleaved benchmark execution would help here. Maybe that's just the "true" runtime distribution (for whatever reason)? |
I see, so there's a lot going on here 😄. I'm not sure if we can trust non-robust estimators like mean and standard deviation here. The outliers have too large of an effect: here is a moving average view of all benchmark runs with a window size of num_runs = 1000 (note that I'm excluding the boundary regions here): The plateaus in the data come from outliers way outside the visible range: So these three bursts of outliers could also lead to the imbalance between the first and the second half in terms of mean run time. If we look at robust estimators (median and median absolute deviation / MAD), the picture looks a bit different:
Using MAD/sqrt(N) as a robust estimator for the "standard error", we then find:
which still indicates a (statistically significant) difference of 0.1 ms between first and second half. |
Wow! Thanks for the detailed analysis. It is really helpful to see. I guess this leaves me with the question: what is the correct plan of action given all of this information?
|
Interleave and use the minimum measurement. :) |
I don't think that would work well. I think my example falls into the same category as "networking" in the talk. The effects of file io matter to the performance. Just taking the minimum would remove a lot of the characteristics of the program that are extremely important. But interesting idea for certain classes of benchmarks. Also, I don't think interleaving matters if you are just taking the minimum. |
If you are doing I/O on a filesystem backed by a physical storage device, you are pulling in a number of very complex systems into the benchmark. I would suggest testing on an isolated tmpfs.
I agree that taking the minimum doesn't make a lot of sense if the tested code is nondeterministic. One such cause of nondeterminism is state which outlasts the benchmarked program's execution, which would include the state of the filesystem (in RAM and on disk), OS mechanisms like caching layers, and any backing physical devices. I think such a benchmark would overall just not be very meaningful, unless each invocation started from a clean state (i.e. the filesystem is created and mounted from scratch every time).
It can matter in case of a lag spike which would cover one benchmarked command's non-interleaved execution. |
I definitely agree that can help, but I don't think that is a solution I want to take. The issues is that the complexities of the file system are huge part of the performance here. If those complexities are remove, the results are not accurate. Linking is essentially all file io. Yes, it does some other stuff and may even generate the final binary in memory before dumping, but there is no substantial CPU work. As such, taking advantage of a tmpfs can totally distort the performance picture. |
If you would like to include measuring the performance of the filesystem implementation as part of your benchmark, I would suggest to use a clean filesystem backed by RAM, fresh for every test. If you would like to also include the physical storage device as part of your benchmark, I would suggest performing an ATA reset, and TRIM / ATA secure erase (or the NVMe equivalents) for flash devices, for every test. Otherwise, each iteration depends on the previous iteration's state, plus any undefined initial state, plus interference with all other processes which interact with these systems. This just doesn't make for a very meaningful measurement - the results will be effectively unreproducible. |
So I decided to dig into this for my application just to better understand what really causes the performance variations. Turns out that there are 3 core distributions to the execution speed of my applications:
To my surprise, running on a tmpfs backed by memory vs on zfs backed by ssd made very minimal differences to any of these distributions. These results make me really wish syncing and dropping caches wasn't so slow. I think that "lightly" cached scenario would be a good target for general benchmarking of this app, but that takes astronomically longer to run than not clearing caches at all only to record a slightly cleaner result. I might just have to do some sort of processing of the data to remove one of the peaks. |
Thank you for the feedback on this.
How many runs do you perform? I would hope that 10-100 would be enough for reasonable statistics. Especially if you "control" the caching behavior. Not >1000 like you did for the full histograms above. Do you know what the reason for the "lightly" and "heavily" cached scenarios is? Why do we see the "lightly"-cached peak in the histograms at all? Shouldn't we mostly see the "heavily" cached scenario? |
So, did some more digging and found the cause of the bimodal distribution. It actually isn't cause by reading caching at all. It is caused by synchronously flushing an mmap to disk. Apparently that is bimodal. If I change from a synchronous flush at the end of linking to an asynchronous flush, the bimodal distribution goes away. This sadly doesn't not quite seem to be a full fix in my case because it adds a weird behavior where there is about a 1 in 1000 chance that the execution will freeze for about 4 seconds. I assume this is due to the app trying to close before the mmap has finished flushing. No idea why that would freeze for so long, but probably a kernel problem of some sort. |
Even when the mmap is of a RAM-backed file? That's interesting, maybe you could dump |
So, I just had the realization that on my current machine
This looks to also be correct. I added Thanks a ton for helping me dig through this. Really goes to show how many things can mess up benchmarking. |
I'm also in the laptop thermal throttle camp. I have an ugly little workaround I use to avoid this issue. It works by using named pipes in the setup/prepare/cleanup functions to coordinate two # process "a"
hyperfine -r10 -s 'mkfifo a b || true; cat a' -p '>b; cat a' 'sleep 0.5'
# process "b" (from same working dir as above)
hyperfine -r10 -p '>a; cat b' -c '>a' 'sleep 0.2' It only works if the number of iterations is fixed and equal on both sides. There are probably nicer/fancier ways to do this, but the above solution works for me. |
I think I changed my mind on this. This is a feature that should be supported natively by hyperfine. |
@sharkdp I would love to see this feature. I have a particular laptop which always runs about 20% faster for 10 minutes or so, then settles down to the baseline. I've been unable to "fix" this by BIOS settings or performance settings in the operating system, after many attempts. This appears not to be a "turbo" artifact but a thermal limit. Other than specifying a long sleep as a cleanup command (which allows for the CPU to cool down and makes testing take forever), alternating the tests between command command 1 and command 2 would produce fairer results. Alternately, some sort of options for cleaning the data would be useful, but probably would be too easy to "abuse". For example, automatic removal of outliers by calculating an upper and lower bounds by taking N times (where a useful N is probably somewhere between 1.2 and 5, configurable by the user) standard deviations from the mean of the tested values, assuming an otherwise normal distribution. |
CyberShadow on HN:
The text was updated successfully, but these errors were encountered: