-
-
Notifications
You must be signed in to change notification settings - Fork 490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Airspeed Velocity for Regression Testing #35046
base: develop
Are you sure you want to change the base?
Conversation
I'd still find this wonderful! |
src/sage/benchmark/__init__.py
Outdated
@@ -0,0 +1,128 @@ | |||
from sage.all import * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW: in #35049 (trivial PR moving some existing benchmark files to the same place) I used sage.tests.benchmarks
instead of sage.benchmark
. No opinion on which is best, but I guess we don't want to keep both.
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## develop #35046 +/- ##
===========================================
- Coverage 88.57% 88.57% -0.01%
===========================================
Files 2140 2141 +1
Lines 397273 397370 +97
===========================================
+ Hits 351891 351963 +72
- Misses 45382 45407 +25
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
This reverts commit 9fb7e1d.
it appears that you cannot send arbitrary amounts of data through a message queue in Python, see https://stackoverflow.com/questions/10028809/maximum-size-for-multiprocessing-queue-item. The queue will block pretty early. So we take load off the queue by using the local file system for some of the transfers. We still need to delete the files which is not done in this proof of concept yet.
… can be resolved by get_benchmark_from_name() in asv.benchmark
by making sure that a module.class.method lookup works, i.e., by not having any . in the wrong places.
I ran matrix/ and rings/ for all the releases between 9.7 and 9.8.rc1. The results are https://www.imo.universite-paris-saclay.fr/~ruth/asv/html/. tl;dr the output is already useful but also very noisy. We should improve the regression detection to get rid of much of the noise. If you go to regression there's a bunch of noise in the beginning (I guess) where just the rc1 was much slower than all the previous ones. So likely, there was just load on the machine when these samples were taken (maybe we can tune asv's regression detection to not detect these?) [more noise]
The first benchmark that is not of this kind is [docstring.sage․rings․power_series_poly.track_PowerSeries_poly․truncate](https://www.imo.universite-paris-saclay.fr/~ruth/asv/html/#docstring.sage%E2%80%A4rings%E2%80%A4power_series_poly.track_PowerSeries_poly%E2%80%A4truncate?commits=047281e0-9116c558). This seems to be noise however. At least I cannot reproduce this.
The first benchmark that is not of this kind either is docstring.sage․rings․polynomial․multi_polynomial.track_MPolynomial․is_symmetric. However, you can see that this one has recovered already in the latest rc. (Upon closer inspection, this seems to also have been noise.) Then there's docstring.sage․matrix․matrix0.track_Matrix․is_singular and this I can finally reproduce. So, there seems to be a speed regression here. I can also reproduce docstring.sage․rings․lazy_series.track_LazyPowerSeries_gcd_mixin․gcd; however, since this feature was only introduced in the 9.8 series, it's likely just a bugfix that makes the computation slower. Some things look quite striking in the graphs like docstring.sage․rings․polynomial․polynomial_ring.track_PolynomialRing_general․element_constructor but I just cannot reproduce them. While this could be noise, I doubt it somewhat. It could simply be that the surrounding tests have changed, so due to some changed caching, the timings change at some point. To support this theory somewhat, I get the slow timings on a 9.7 build when just running this one doctest in a SageMath session. PS: I note that printing of exceptions is slightly slower with 9.8.rc1 than with 9.7. So maybe this explains some of this noise in the doctest timings? I should maybe run all the tests again and see if I observe the same thing again. |
Very general question: what is actually compared, in particular, if the doctests are modified? |
The doctests are extracted (hover over a test and you should see the extracted doctest.) When the doctest changes, the results are not related anymore. Currently, the old results are not shown at all when this happens. We could probably change this behavior and still show them but label them differently. |
I don't think that's very useful. We are going to have similar (or worse) amounts of noise when we use GitHub CI instead. So what we come up with should work for a very noisy runner.
I just extracted (the relevant bits of) |
Same picture: @mantepse And actually, I can reproduce this regression. I paste the following in a sage repl (the preceding doctests)
and then I call
This is on a different CPU than the one that created the graphs. So the timings are different. But the change is on a similar scale. |
So, one conclusion here could be that we need a little script to run all doctests up to some line to make it easier to "debug" these kinds of reports. |
Without any tricks, I can reproduce this slowdown (since @mantepse probably cares about species?) |
I tweaked the regression detection data a bit and updated the deployed demo at https://www.imo.universite-paris-saclay.fr/~ruth/asv/html/#regressions?sort=3&dir=desc Now the detected major regressions look legit to me (at least the graphs on the first page are mostly convincing.) |
Hm, on my box (ubuntu 22.04.1, but I guess that's irrelevant), I can only reproduce a slight difference in a fresh session, I get about 2.6ms vs 3.45ms. If I do the test (including the stuff that came before) again, the difference is gone. What puzzles me, however, is that this regression (if it is real) should be visible also in other doctests. Still, I'm not sure how relevant this is. I really think it would be better to have a few specially marked tests (i.e., In the case at hand, for example:
(and this won't work, because it exhibits the missing random generator, which is #34925) |
Yes, I was thinking more about older releases.
Same here. |
@mantepse did you run the test as I had posted it, i.e., with the |
For benchmarking explicitly, I propose to use asv the way it is meant to be used, i.e., just drop some benchmarks in the
No, I only run the normal doctests. This is also what is being run by the GitLab CI on each merge currently. |
The problem is that this disconnects code from tests. I would guess that we would be much more encouraged to write benchmark tests, if they are kept in the same source. Possibly one could also automate some for certain categories, but I think it would be great if we could start somehow and refine later. If benchmark tests take around 1 second, I suspect that noise would also be less of an issue.
Did you try to run the long doctests, or do they take too long? More explicitly, I propose to provide infrastructure for either a new section I think we could specify that these tests should be independent of other tests and should, if at all possible, never be modified. What do you think? |
I am a bit torn here but I think we should try to just use the tools in the way they are meant to be used. We have some benchmarks in SageMath already and nobody cares about them it seems, so that shows that we should maybe not create a A marker |
I am mostly interested in avoiding performance regressions, because I spent quite some time fixing constructors that were unnecessarily slow. In particular, these were an obstacle for findstat, some, for example the constructor for posets, still are. Thus, I would love to see a bot that runs on every ticket, or at least on every beta relaease, which reliably warns us of (major) regressions. Is this possible with the current approach? Are the timings always taken on the same machine, or is there some magic that accounts for different machines? If we can run the long doctests on github, then maybe we do not need an extra marker - just mention in the developer's guide that these are used to detect performance regressions. However, I'm afraid that we will hit the problem that we cannot compare doctests which changed. Do I understand correctly that doctests for a method cannot be compared if even one of them was modified? Here is yet another idea to avoid the problem of changing doctests and to get started more quickly. We could have a marker
Then (if this is possible at all), the bot would run this test with all binaries available which are younger than A prerequisite is, of course, that we can access pre-built binaries. Is this possible - difficult - easy? To get started, we could add this marker to the This approach might have some advantages:
If this approach is easy to implement, there are some later optimizations possible:
|
I don't think this is possible reliably. Other projects have experimented with this in the past and from the examples I know are not using this setup anymore. Even if you run the benchmarks on the same machine every time, you need to go to some length to get reproducible timings (disable power management, make sure there's nobody else on the same hardware, so no virtual machines, make sure nothing else runs in parallel, so no parallel execution of tests, …) In principle the timings are on very similar machines. The tests run on GitHub CI virtual machines that have in my experience very similar hardware. If the hardware changes a lot, one can detect the generation of the machine and set a However, to your
Yes if one line of the the actual doctested content changes, the old timings are not shown anymore. I think this is the correct approach. I don't want to use doctests of actively developed code for benchmarking. We should write proper benchmarks for this. But I want to see that much of number theory got a bit slower because somebody changed a small detail of how
Sorry, this is not easily possible. We don't even have all of the recent releases available as docker images or conda builds and they even have dependencies moving out of sync with how the SageMath distribution has updated them in some cases. We have no recent betas available as binaries in that format. |
@mezzarobba do you think that the regressions shown at https://www.imo.universite-paris-saclay.fr/~ruth/asv/html/#regressions?sort=3&dir=desc now look reasonable enough? |
I only see regressions in |
Yes, I only ran combinat. Let me run all of SageMath. I'll ping you tomorrow with the full output :) Edit: The machine I use to run these tests is currently not functioning properly (somebody spawned a job that ate all the RAM) so it might take some more time for the admins to take care of this first. |
Sorry for asking so many questions: if it is not possible to reliably detect major regressions, what is the aim of this ticket? |
I think we can reliably detect major regressions a while after they happened and we can then also detect where they happened. We can also get good hints at where a regression might have happened. For actual benchmarks in |
Concerning
and
I don't really understand.
Having such a benchmarking tool as a bot would be nice, but I don't think it is absolutely necessary. In fact, I think it would even be sufficient if it is run for every beta, and not for every ticket, if that's a problem. If it is less work for you: maybe it would be more efficient to have a video conference, so you could explain the problems to me in person? |
We have failed to find a maintainer for the docker images for the past few years, i.e., have somebody build binaries and make sure that the script used to build those binaries works somewhat reliably on some sort of CI hardware. (Part of the story was also that we couldn't find anybody to maintain that CI infrastructure …) I am very reluctant to create any process that does become more than just a tiny added step in our current setup because it creates a burden to maintain. (We run the CI on the doctests anyway, so let's just export the data it collects. And while we parse that data, asv will also run the benchmarks that are explicitly marked as such.)
Sorry, if that's not clear. With the changes in this PR, asv runs all benchmarks that are in If you want to write benchmarks that are implemented outside of
Sure. I can do today at 6pm Germany time or basically any time tomorrow. Let's see if we can agree on some time at https://sagemath.zulipchat.com/#narrow/stream/271079-doctest/topic/tracking.20performance.20regressions |
Documentation preview for this PR is ready! 🎉 |
Any news here? |
No, I had no time to work on this. (And had missed your question.) |
Looking forward to the invention of the time turner :-) |
📚 Description
…the changeset is probably not useful anymore. I just resurrected what was on #25262.
📝 Checklist
TODO
See comments in the source code. Additionally:
[...]
in the name.docstring
as the module name fixed the import timings. However, now everything shows up asdocstring
in the benchmark grid :(