Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve caching strategy across the board of CI workflow #45289

Merged
merged 1 commit into from
Jan 1, 2025

Conversation

potiuk
Copy link
Member

@potiuk potiuk commented Dec 30, 2024

We are using various caches in our build and so far - due to the way how "standard" caching works, PRs from forks could not effectively use the cache from main Airflow repository - because caches are not shared with other repositories - so the PRs builds could only use cache effectively when they were rebased and continued running from the same fork.

This PR improves caching strategy using "stash" action from the ASF. Unlike cache - the action uses artifacts to store cache, and that makes it possible for the stash action to use such cache uploaded from main canary builds in PRs coming from the fork.

As part of this change all the places where setup-python was used and breeze installed afterwards were reviewed and updated to use only breeze installation action (it already installs python) and this action has been improved to use UV caching effectively.

Overall this PR should decrease setup overhead for many jobs across the CI workflow.

Follow-up after #45266


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@potiuk
Copy link
Member Author

potiuk commented Dec 30, 2024

Looks like the stash action does not work exactly as advertised :)

@gopidesupavan
Copy link
Member

Looks like the stash action does not work exactly as advertised :)

think we need to set env variable name head_name https://github.com/apache/infrastructure-actions/blob/main/stash/restore/action.yml#L108

@gopidesupavan
Copy link
Member

Looks like the stash action does not work exactly as advertised :)

think we need to set env variable name head_name https://github.com/apache/infrastructure-actions/blob/main/stash/restore/action.yml#L108

ah head_name is derived from github ref.

@potiuk
Copy link
Member Author

potiuk commented Dec 30, 2024

Yeah - it seems that the action likely gets into a race condition, or does not work as advertised:

From: https://github.com/apache/infrastructure-actions/tree/main/stash#usage

Using the save action again in the same workflow run will overwrite the existing cache with the same key. This does apply to each invocation in a matrix job as well! If you want to keep the old cache, you can use a different key or set overwrite to false.

It seems that we have error 409 conflict not handled (I tried with both override 'false' - there it fails obviously) and default true - and there it should work - but I think there is a race condition that when multiple jobs are trying to upload the same artifact - which is happening in our case - they get 409 conflict:

Root directory input is valid!
Error: Failed to CreateArtifact: Received non-retryable error: Failed request: (409) Conflict: an artifact with this name already exists on the workflow run

CC: @assignUser - is the right guess ?

@potiuk
Copy link
Member Author

potiuk commented Dec 30, 2024

Evidently it tries with "overwite: true":

image

@potiuk
Copy link
Member Author

potiuk commented Dec 30, 2024

If that's the right guess, then we might try to handle it somehow - since we don't care which uploaded artifact will be uploaded - we can actually even ignore 409 when it happens - without retrying it because it means that someone else managed to upload the artifact in parallel and they "won".

@potiuk
Copy link
Member Author

potiuk commented Dec 30, 2024

cc: @assignUser - seems we are stressing your action to the limit :)

@assignUser
Copy link
Member

Wow that's wild, haven't seen that before :D But that's an issue with the artifact backend, nothing really I can change in the action itself.

@potiuk
Copy link
Member Author

potiuk commented Dec 30, 2024

Wow that's wild, haven't seen that before :D But that's an issue with the artifact backend, nothing really I can change in the action itself.

Let's see... PRs might be coming :)

@potiuk
Copy link
Member Author

potiuk commented Dec 30, 2024

FYI. @assignUser and @gopidesupavan -> seems that this is a well known "feature" of the @v4 action that caused a number of users a problems when migrating to @v4. The proposed solution (which IMHO is just a workaround) is to upload each artifact with a different key and use "merge-multiple" feature and glob pattern to download and merge all such uploaded artifacts (!??!!#$@#%!%!$%!#%!)....

This is even explained here: https://github.com/actions/download-artifact/blob/main/docs/MIGRATION.md#multiple-uploads-to-the-same-named-artifact as solution.

actions/upload-artifact#478

Example solution: actions/upload-artifact#478 (comment)

But stash does not use the download-artitact action and "merge-multiple" ... And I really do not like the "solution". .. So we wil have to come up with a different approach.

@potiuk
Copy link
Member Author

potiuk commented Dec 30, 2024

OK. @assignUser and @gopidesupavan -> I think I found a solution (and actually this is a better one in general for performance, but slightly more "distributed" among the .yml files.

Instead of heaving a clear save/restore around installation, I only do:

  • restore
  • installl

And I make sure to have one separate job that is prerequisite of all other jobs (build-info which was already there as a single job that kicks-off the rest - for uv cache and install-pre-commit that is added as "needs" for all jobs that need the cache and those "needs" job are the only ones that upload the artifact.

This way we get a little longer bootstrap, but then all the other jobs should use the cache uploaded by the prerequisite job. Plus the bootstrap will also use the artifact from previous runs (or target branch) if corresponding pyproject.toml / pre-commit config files did not change.

@potiuk potiuk marked this pull request as ready for review December 31, 2024 15:55
@potiuk
Copy link
Member Author

potiuk commented Dec 31, 2024

OK. I think i Implemented all workarounds to not have to wait for any of the stash action PRs I created, let's see what will be the savings now

Copy link
Member

@gopidesupavan gopidesupavan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice these workarounds LGTM :)

@potiuk
Copy link
Member Author

potiuk commented Dec 31, 2024

There is one more error with tar ... Pre-commit is currently stored with /hom/runner prefix, that's why it had no impact on the timing.

But K8S is already hugely impactful -> instaling the venv + all the tools is ~ 1 minute. Savingit is 30 seconds, restoring 10 seconds - since we have many k8S jobs, this will be huge improvement

@potiuk potiuk force-pushed the improve-caching branch 6 times, most recently from 10f19da to ed286c5 Compare December 31, 2024 19:48
@potiuk
Copy link
Member Author

potiuk commented Dec 31, 2024

Nice. 1m35 s -> 30s for pre-commit environment.

@potiuk potiuk force-pushed the improve-caching branch 12 times, most recently from bd210d4 to 1aa9d07 Compare January 1, 2025 00:34
We are using various caches in our build and so far - due to the
way how "standard" caching works, PRs from forks could not effectively
use the cache from main Airflow repository - because caches are not
shared with other repositories - so the PRs builds could only
use cache effectively when they were rebased and continued running from
the same fork.

This PR improves caching strategy using "stash" action from the ASF.
Unlike `cache` - the action uses artifacts to store cache, and that
makes it possible for the stash action to use such cache uploaded from
`main` canary builds in PRs coming from the fork.

As part of this change all the places where setup-python was used
and breeze installed afterwards were reviewed and updated to use
only breeze installation action (it already installs python) and this
action has been improved to use UV caching effectively.

Overall this PR should decrease setup overhead for many jobs across
the CI workflow.

Follow-up after apache#45266
@potiuk potiuk merged commit 1bcabb8 into apache:main Jan 1, 2025
86 checks passed
@potiuk potiuk deleted the improve-caching branch January 1, 2025 01:29
LefterisXefteris pushed a commit to LefterisXefteris/airflow that referenced this pull request Jan 5, 2025
We are using various caches in our build and so far - due to the
way how "standard" caching works, PRs from forks could not effectively
use the cache from main Airflow repository - because caches are not
shared with other repositories - so the PRs builds could only
use cache effectively when they were rebased and continued running from
the same fork.

This PR improves caching strategy using "stash" action from the ASF.
Unlike `cache` - the action uses artifacts to store cache, and that
makes it possible for the stash action to use such cache uploaded from
`main` canary builds in PRs coming from the fork.

As part of this change all the places where setup-python was used
and breeze installed afterwards were reviewed and updated to use
only breeze installation action (it already installs python) and this
action has been improved to use UV caching effectively.

Overall this PR should decrease setup overhead for many jobs across
the CI workflow.

Follow-up after apache#45266
agupta01 pushed a commit to agupta01/airflow that referenced this pull request Jan 6, 2025
We are using various caches in our build and so far - due to the
way how "standard" caching works, PRs from forks could not effectively
use the cache from main Airflow repository - because caches are not
shared with other repositories - so the PRs builds could only
use cache effectively when they were rebased and continued running from
the same fork.

This PR improves caching strategy using "stash" action from the ASF.
Unlike `cache` - the action uses artifacts to store cache, and that
makes it possible for the stash action to use such cache uploaded from
`main` canary builds in PRs coming from the fork.

As part of this change all the places where setup-python was used
and breeze installed afterwards were reviewed and updated to use
only breeze installation action (it already installs python) and this
action has been improved to use UV caching effectively.

Overall this PR should decrease setup overhead for many jobs across
the CI workflow.

Follow-up after apache#45266
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants