Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make CI more efficient #10704

Closed
cbrnr opened this issue Jun 3, 2022 · 7 comments
Closed

Make CI more efficient #10704

cbrnr opened this issue Jun 3, 2022 · 7 comments

Comments

@cbrnr
Copy link
Contributor

cbrnr commented Jun 3, 2022

Over the years, our CI jobs have grown in both number and run time. While high test coverage is essential, I have the feeling that many of our tests could be optimized in the sense that currently we're doing a lot of inefficient extra work. I haven't looked at concrete examples yet, but the usual practice is to add new tests for new features or bug fixes, often without thinking if the new behavior could be more efficiently checked by modifying an existing test.

Many of our jobs run for almost an hour, which means developers have to wait at least that long before merging. In addition, every subsequent push triggers the whole CI pipeline again. This also means that we're consuming a significant amount of power, which in times of the climate crisis is something we should try to avoid.

Could we use this issue report to discuss (1) if we agree that we should try to minimize CI run time, and (2) if so, collect ideas how we could achieve this?

Some things that come to mind immediately:

  1. Cancel all running jobs if another push is made to a PR (not sure if this is already the case, or if running jobs keep running until they are done).
  2. Determine on a very high level if we have redundant jobs. I currently do not have a good knowledge of why we need all these jobs on different platforms (Azure, CircleCI, GitHub Actions, ...), but we should make sure that we're not running any duplicate tests.
  3. Not sure how much influence downloading and installing mne and all its dependencies on CI has, but I assume we are already caching this? Same for downloading our testing data, if we don't have to do it every time we set up a new test run this would save a lot of time.
  4. Maybe we could have people start with draft PRs that do not trigger CI? Or just a reduced subset such as style and/or only the tests that are directly affected by the changes?

IMO the long-term goal should be a thorough inspection of our test suite to optimize, merge, remove, and change tests.

@larsoner
Copy link
Member

larsoner commented Jun 3, 2022

Indeed this has always been a goal, and we periodically try to update it. I suspect you don't have repo-level notifications on, otherwise you would know that caching/speed PRs happen fairly frequently already :) Just as a few examples in the last ~1.5 years:

At the moment I don't see much room for improvement unfortunately, outside of one idea about conda caching below. Specifically here's why:

I have the feeling that many of our tests could be optimized in the sense that currently we're doing a lot of inefficient extra work. I haven't looked at concrete examples yet

Feel free to look. We output the timings -- and even added our own module/folder-level timings summarizer -- specifically to help figure out what could be optimized. I'd say once or twice a year I make a PR to reduce test time by looking at exactly the sorts of things you point out here. It's likely I've missed some, but I hope/think I've gotten rid of most of the obvious big blockers.

Many of our jobs run for almost an hour, which means developers have to wait at least that long before merging.

Indeed but I don't find this too onerous.

I wonder if in part this frustration with slowness can be ameliorated by changes in development practice (that we should maybe emphasize more in the dev docs). In practice, feeling like the CI process takes too long could be a symptom of an inefficient development workflow, whereby a contributor sees a failure, makes a change locally, and immediately push to see if it makes CIs happy. In practice what should be done instead is that tests should be run locally as much as possible first, before that push -- e.g., targeted pytest mne/io/... mne/tests/..., make flake, python tutorials/..., and maybe even make -C doc html_dev-noplot or PATTERN=... make -C doc html_dev-pattern already tell you a lot about whether or not CIs will succeed.

Working this way greatly decreases both development time (less time waiting for CIs) and CI run time (energy consumption) by reducing the number of unnecessary commits/waits. I'm not sure if everyone already works this way, but I think we should emphasize that people should rely on their own local testing first, and think of CIs as the "final, complete check" once everything works locally. For me, using this approach, the CI time rarely feels like an issue (really only around release time!) -- on my PRs I expect everything to come back green based on my local testing, and when it doesn't, it shows I forgot a decorator, or forgot to run make flake (this happens within minutes), and then usually the next push is good 90% of the time. So it doesn't slow down the flow me very much at least.

There are of course times I do the "suggest changes" and "just push" without testing, but in those cases I'm not surprised/disappointed when I have to wait another 1h CI cycle, because I usually could have avoided it with a tiny bit of local testing.

  1. Cancel all running jobs if another push is made to a PR (not sure if this is already the case, or if running jobs keep running until they are done).

This should be implemented already for all CIs

  1. Determine on a very high level if we have redundant jobs. I currently do not have a good knowledge of why we need all these jobs on different platforms (Azure, CircleCI, GitHub Actions, ...), but we should make sure that we're not running any duplicate tests.

Jobs are highly redundant, but in what I consider to be the right way. We test all tests (except ones that are too slow) on

  • Linux, macOS, and Windows
  • Various Python versions from 3.7 through 3.10
  • On prerelease dependencies
  • On installs with full dependencies, and with minimal deps, and with oldest supported dependencies

Each of these checks routinely find (different types of!) issues over time, and we've tried to deduplicate to the extent possible (e.g., by not using a full matrix of all combinations of these possibilities, but rather subsets).

  1. Not sure how much influence downloading and installing mne and all its dependencies on CI has, but I assume we are already caching this? Same for downloading our testing data, if we don't have to do it every time we set up a new test run this would save a lot of time.

Yes, Guillaume did some work on this, feel free to search PRs if you're interested. We have testing dataset caching, but it only saves a little bit of time (maybe 1 minute?). We have pip caching, and it saves a little bit of time (maybe also a minute?). IIRC it turns out that the conda folder is huge, so caching it took longer than not caching it. And we don't want to cache the actual env (at least not always), because it's important to test that the environment can actually be created from scratch. (We fairly frequently find conda packaging bugs this way.)

Thinking about it now, one way to speed things up would be to cache the built conda env itself -- but then to make sure we continue testing env compat, have a nightly build that clears and updates the cached env using a new build. My guess is that this would save 5-10 minutes on the conda jobs in actual PRs, which would probably be worth the added CI/YAML complexity.

  1. Maybe we could have people start with draft PRs that do not trigger CI? Or just a reduced subset such as style and/or only the tests that are directly affected by the changes?

I think draft PRs trigger CIs the same way, not sure if that can be changed or if we want to. Most often I see "draft" to be meant as "don't need to review yet", but people still want CIs to run. To me again this part is a best-practice / dev-process issue, where people should use our [skip ...] markers as much as possible to save CI time.

@cbrnr
Copy link
Contributor Author

cbrnr commented Jun 3, 2022

Thanks @larsoner, indeed I have repo-level notifications turned on, but it's still hard for me to follow every aspect of MNE-Python development (time being the limiting factor, so I skim the titles on PRs where I'm not explicitly mentioned and if it doesn't immediately catch my attention I mark it as done).

Good to know that you have been taking care of our tests with regards to run time and efficiency. I agree that people should run tests locally first and only run CIs when they have finalized their PR. Of course this is not always possible, but it is a useful goal. However, I think that if we don't somehow enforce this and keep running all jobs on every PR, nothing will change. I have to admit that I don't know all of our markers (and what exactly they do) by heart (and I often forget to include them in the commit message), so maybe we could include this information in the PR template?

Another thing we could do to avoid style errors is to have the flake8 check first, and we start the other jobs only if the style test succeeds.

The thing with looking for inefficient tests is that it's often not the ones with a long run time. Many shorter inefficient tests can also add up, but they are much harder to identify (unless you really dig in deep). But of course, the longest jobs are the easiest targets.

Re redundant jobs, running the same tests for different Python versions is not redundant. I meant that because we're using so many different platforms that maybe it would make sense to try to consolidate some jobs to fewer platforms, which means that all of the boilerplate preprocessing would not happen that often. GitHub Actions has images for all three platforms, which makes it possible to have one configuration file with a matrix of options. Which could possibly lead to more efficient jobs, but not necessarily.

Finally, what I think is really the core issue is how people write tests, the actual content of our test modules. I can only speak for myself, but when I add a test I usually don't have time to go through everything else that's already available in the relevant test module. I skim the tests, but more often than not I'm just writing a new test case which includes the required setup (generating test data and so on). I am pretty sure that there is potential to consolidate the different test cases in many modules (e.g. just create some data once and then run the various tests). This is a huge effort, and I'm not sure anyone is willing to do it (although I think it would be important and has great potential to make our tests more efficient).

@drammock
Copy link
Member

drammock commented Jun 3, 2022

Another thing we could do to avoid style errors is to have the flake8 check first, and we start the other jobs only if the style test succeeds.

I'd be OK with that

The thing with looking for inefficient tests is that it's often not the ones with a long run time. Many shorter inefficient tests can also add up, but they are much harder to identify (unless you really dig in deep). But of course, the longest jobs are the easiest targets.

I had the same thought. Personally I'd like it if most our tests were <10 lines of code, with very descriptive names, and with smart use of helper functions (to keep it DRY) and pytest fixtures (to avoid computing the same thing twice). I imagine / hope that would make it less necessary to "dig deep" when writing new tests.

I can only speak for myself, but when I add a test I usually don't have time to go through everything else that's already available in the relevant test module. I skim the tests, but more often than not I'm just writing a new test case which includes the required setup (generating test data and so on).

It's not just you.

I am pretty sure that there is potential to consolidate the different test cases in many modules (e.g. just create some data once and then run the various tests).

I'm agnostic as to whether a thorough scrubbing of our tests would make a small difference or a big difference. I used to be "pretty sure" too, but as time goes by I'm starting to suspect that it was wishful certainty based on impatience. 🤷🏻

In the end, someone needs to dedicate the time somehow. We could consider doing a code sprint that was restricted to only working on our test efficiency (and coverage); how much time could you commit to it if we did that?

@larsoner
Copy link
Member

larsoner commented Jun 3, 2022

The thing with looking for inefficient tests is that it's often not the ones with a long run time. Many shorter inefficient tests can also add up, but they are much harder to identify (unless you really dig in deep). But of course, the longest jobs are the easiest targets.

FWIW this concern is really what prompted the addition of the folder/module level timing reporting, in addition to the built in individual (slowest) test timing. It tells you what the best targets for time reduction regardless of the time per individual test. So that's a good starting point for where to reduce/consolidate.

@cbrnr
Copy link
Contributor Author

cbrnr commented Jun 7, 2022

I could commit some time if we did a sprint (and if I know a couple of weeks in advance), but it would be great to find others to work on this too. One thing I could try to do is to make all tests depend on the successful completion of our style checks. This would probably require combining our separate configuration YAMLs into one file (https://docs.github.com/en/actions/using-workflows/about-workflows#creating-dependent-jobs).

@larsoner
Copy link
Member

larsoner commented Jun 7, 2022

Now that GH-actions supports restarting just failed builds we could using a single file. The only downside is we might have to wait a little bit sometimes to restart jobs because you have to wait for all to complete before you can try, but I think that's okay. Azure is already that way and it's not too bad to wait usually.

The first step I think would be a no-op PR that combines GH-actions into a single file, and we get that green and merge. Then the next step could be contingent processing. FWIW we could also do contingent processing in Azure the same way, as that's already in a single file, and also already runs at least some style checks (though not all IIRC). I'd be +1 for running all style checks both places, they are very fast so the redundancy is not too painful.

@larsoner
Copy link
Member

Closing since I don't see any immediate areas of improvement here at this point, but we can reopen if I missed something

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants