-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decide how to handle "probably" relevant comparisons #983
Comments
Do you want to try to come up with some threshold for the number of test case results with the same profile but with small magnitudes should be considered "definitely" relevant? |
I think that probably makes sense; it'll probably be worth iterating on but maybe we can start by saying something like 10 benchmarks pointing in the same direction from the same profile is enough to bump us to definitely? Not sure if 10 will feel right in the long run, but it feels like a good start. |
https://perf.rust-lang.org/compare.html?start=b6057bf7b7ee7c58e6a39ead02eaa13b75f908c2&end=c02371c442f811878ab3a0f5a813402b6dfd45d2&stat=instructions:u is a similar case where the regression consistently shows up across the many-assoc-items benchmark -- all of check,debug,opt are regressed. It was judged "probably" rather than definitely, but seems like it should have just been in the mixed category. |
https://perf.rust-lang.org/compare.html?start=a8387aef8c378a771686878062e544af4d5e2245&end=b27661eb33c74cb514dba059b47d86b6582ac1c2&stat=instructions:u is another example -- await-call-tree shows several profiles regressing. |
Did this catch your eye because all of the improvements are in incremental?
Hmm... this seems pretty border line to me. I agree it's probably worth a look, but the categorization of probably relevant seems correct over definitely relevant. |
No, await-call-tree across 3 profiles. The incremental improvements are less 'interesting' typically, particularly when looking at incr-unchanged benchmarks -- those just have less going on so smaller changes are more likely to show up as significant. |
Hm -- so maybe our understanding is different, but the way I see it -- if we expect a human to go investigate, then it should be autocategorized as definitely. If that investigation is short, fine, but to some extent right now we would have missed this result without perf triage reports (just limiting to perf-regression labels) which seems unfortunate. |
Our understanding of these categories is definitely different. It seems you have a "when in doubt it should be in triage" perspective. The original purpose of these categories was to differentiate between comparisons where we were sure that a performance change had happened (and we just need a human to help investigate the cause and determine if the code change in question is worth accepting the performance penalty) vs. changes where we're unsure (although fairly confident) if an actual performance change has happened or if it's just noise (and we need a human to first determine whether this change is actually a performance change and not noise). This is why the names are "definitely relevant" and "probably relevant" which describe this distinction pretty well. In particular, originally I purposefully created a higher bar for triage than for perf-regression labels. This is why anything that is labeled as "probably relevant" or "definitely relevant" gets a perf-regression label but only "definitely relevant" changes get included in triage. Triage was to be reserved for changes where we are sure that there is an actual performance change. It sounds like perhaps you're advocating for getting rid of the "probably relevant" distinction (since we've gotten relatively good at filtering out noise) and we should only make a distinction between "definitely relevant" and "maybe relevant" the former of which we label with perf-regression labels AND note in the triage report and the former we simply ignore (as we currently do with such cases). Thoughts? |
Based on a conversation with @Mark-Simulacrum it sounds like we might want to do the following (please correct me if I'm misremembering something):
|
This comparison was categorized as probably relevant - https://perf.rust-lang.org/compare.html?start=7611fe438dae91084d17022e705bf64374d5ba4b&end=bcfd3f7e88084850f87b8e34b4dcb9fceb872d00&stat=instructions:u - but I think it is definitely so, due to the wide range of -doc benchmarks affected.
The text was updated successfully, but these errors were encountered: