-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of CI system #1407
Comments
Related: #1203 |
Added this issue to the milestone. |
Another idea is to improve PRs speed - build ARM images only in master or when the commit has some pre-defined string. Also, we might want to use actions/cache. |
Could you move the multiarch build into a separate GitHub workflow? You'd then get multiple CI statuses on PRs, and could choose to merge after the amd64 job passes instead of waiting for all jobs? |
@manics I agree that is an important optimization - note that you only need to have separate jobs, not separate workflows (that contains X jobs). I suggest we both separate amd from arm (optimization 2) and separate images from each other (optimization 3). |
I've ordered 7 RPi computers and look to make them self-hosted arm64 based runners for us in the Jupyter ecosystem where needed. |
Wow, nice! :) I also wanted to create some VMs on ARM to use it as self-hosted runners, but if you're already on it, that's great 👍 |
@consideRatio did you have any luck with arm runners? |
@mathbunnyru I have a k8s cluster running on 7 raspberry pi computers etc, but I've failed to deploy the github runner software on k8s still. I'm left quite clueless on what is going on with that and have failed to debug it. |
I see. Unfortunately, I have almost zero experience with k8s and absolutely zero experience with self-hosted runners, so I can't help you right now :( |
@consideRatio I noticed, that the build times in master are really slow. Latest master branch is still running - build step took only 1h 16m 43s, push is already taking almost an hour and not yet finished. |
Hmmm, looking into this a bit, my guess is that the cache for layers grows too large and that cause it to be discarded along the way forcing a rebuild or similar. This guess is supported by noting that previous builds have been successfully using a cache for at least the base-notebook image and then suddenly that stops working when you added more images to be built in recent PRs. I have a few ideas on what we could do:
Practically, doing can be done like this. # Without this our cache may get reset.
#
# NOTE: This step needs to run before actions/checkout to not end
# up with an empty workspace folder.
#
- name: Maximize build space
uses: easimon/maximize-build-space@b4d02c14493a9653fe7af06cc89ca5298071c66e
with:
root-reserve-mb: 51200 # 50 GB
build-mount-path: /var/lib/docker/tmp # remaining space
remove-dotnet: "true"
remove-haskell: "true"
remove-android: "true" To do 2, we would just experiment by removing |
Hey folks I would like to help with optimising the CI 😉 Also - on the matter of |
Wow, I didn't know about this project - I will submit a request in a few days, thank you! |
So, I can share my vision of how to make this work.
Note:
|
@mathbunnyru what does |
Thanks, I meant |
@manics what do you think about my proposal? |
Quick note from mobile: native arm runners, i got them running on k8s cluster setup with raspberry computers. The downside..... You cant use the same actions etc you have used in a github workflow. Setup-python etc relies on cached versions in a amd64 maintained github ci environment that wont work on arm64. So, going arm64 native means abandoning typical actions we have relied on. Imo, i'm more positive on having atandalone arm64 builds that then get combined in a manifest augnentation thingy. Anyhow, on mobile, just wanted to warn about arm64 runners challenges |
Thanks, @consideRatio! |
Proposal generally sounds good! In step 6 how are the manifests and tags calculated? Do they require both architectures to be built before calculation? If the calculations in step 6 can be run in parallel for the separate architectures that avoids having to pull the images back down again, instead you could push direct to Docker Hub and save the tags/manifests as JSON or text files, uploaded as GitHub build artifacts (one set of artifacts for each architecture). The main workflow could then fetch those artifacts, confine then as necessary, and update everything else without even touching the images. |
I didn't want to push to docker hub directly, because we might end up in a situation, where x86 images are fine and already uploaded and arm doesn't build for some reason. @manics I've updated my proposition to include your suggestions. |
Yes, makes sense to me 😄 |
Hey folks - since my brain works in a very visual way I went ahead and made a diagram which captures the proposed approach above:
As I mentioned in #1203 I would be happy to start working on a prototype to start parallelising stuff |
@trallard very nice! A few moments:
Please, proceed, I won't have much time for a few months, but I'm ready to review and help if needed. |
Also, I think this diagram will be very useful in the future if/when we implement this, so it might be worth to add a separate page in |
For completeness, I have updated the diagram to reflect @mathbunnyru comment above |
One more small update - we probably want to "Push tags and manifests" in the main workflow. |
So it's a step right before "Can merge tags?" |
I think what we can do now, is not to create multi-platform images as well and make aarch64 tags look like this Right now, every update is a pain - I have to rebuild 5 times to get to the point, when images on DockerHub are the same, as if I built from source. |
A simple-minded comment I posted under an unrelated issue, together with the response from @mathbunnyru:
This is a good suggestion. But we still have to work with amd64/aarch64 differences (for example, we're not building everything under aarch64). |
We have crippled our CI systems performance after introducing support for arm64 based images. A key reason for this is that emulation of arm64 images from the amd64 based runners github provide is far worse, besides the fact that we end up building base-notebook and minimal-notebook for arm64 in sequence alongside the other images now.
I'm not fully sure how we should optimize this long run, but under the assumption that we will have high performance self-hosted arm64 based GitHub Action runners that can work in parallel to the amd64 runners. Below is an overview of a very optimized system, where several parts can be done separately.
Nightly builds
We have nightly builds with
:nightly-amd64
andnightly-arm64
tagsamd64 / arm64 in parallel
All tests for amd64 and arm64 run in parallel, relying on
nightly-amd64
andnightly-arm64
cachesImages in parallel where possible
All tests for individual images are run in a dedicated job that
needs
its base image job to complete.Some images can run in parallel:
Avoid rebuilds when merging
Tests finish by updating a github container registry associated with a PR. By doing so, our publishing job on merge to master can opt to use the images as they were built during tests if they are considered fresh enough.
Parallel manifest creation
Merge to default branch triggers manifest creation jobs on both amd64 and arm64. If we opt to not optimize using step 4 then this could also build fresh images using nightly cache first.
Combine manifests into one before pushing to official registry
Merge to default branch triggers a job that pulls both the amd64 image and arm64 image and defines a combined docker manifest which is then pushed to our official container registry. I think this could be done with something like
docker manifest create <name of combined image> <amd64 only image> <arm64 only image>
but @manics knows more and I lack experience with this.Standalone performance issue
This standalone issue will go away by using better strategies like above. It isn't so critical to fix either I'd say. But currently, we build minimal-notebook again without using cache during
push-multi
assumingpush-multi
forbase-notebook
has already run. It is because we re-tag jupyter/base-notebook:latest I think.The text was updated successfully, but these errors were encountered: