Skip to content

Download build artifacts from the backport branch for testing in the main branch #357

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jan 9, 2025

Conversation

leofang
Copy link
Member

@leofang leofang commented Jan 6, 2025

Close #329.

Update: Please see #357 (comment).

Refresher: There are two kinds of caches that we can use in GHA, Cache and Artifacts. We've been using Artifacts to store build artifacts, which works fine so far but the main issue is the artifacts are scoped on a per-PR basis, meaning they cannot be reused across CI workflow runs triggered by different PRs.

This PR adds the capability of uploading artifacts to the Cache space when a PR is merged into the main branch, so that they can serve as a fallback when a workflow needs certain artifacts for whatever reason. Note that while the Cache space is limited to 10 GB per repo, for our purpose (we have small wheels) it is still OK as a stop-gap solution, until our DevOp team finds a more sustainable one.

I also cleaned up the shell choice a bit so that all job steps use the same setting.

@leofang leofang added P0 High priority - Must do! CI/CD CI/CD infrastructure feature New feature or request to-be-backported Trigger the bot to raise a backport PR upon merge labels Jan 6, 2025
@leofang leofang self-assigned this Jan 6, 2025
Copy link
Contributor

copy-pr-bot bot commented Jan 6, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@leofang leofang force-pushed the cache_artifacts branch 2 times, most recently from c05b589 to f6ec8f2 Compare January 6, 2025 16:56
@leofang
Copy link
Member Author

leofang commented Jan 6, 2025

/ok to test

@leofang leofang force-pushed the cache_artifacts branch 2 times, most recently from f527b33 to d34a5c2 Compare January 6, 2025 17:14
@leofang
Copy link
Member Author

leofang commented Jan 6, 2025

/ok to test

@leofang
Copy link
Member Author

leofang commented Jan 6, 2025

/ok to test

@leofang leofang force-pushed the cache_artifacts branch 2 times, most recently from ae72bed to 70651ea Compare January 6, 2025 18:39
@leofang
Copy link
Member Author

leofang commented Jan 6, 2025

/ok to test

@leofang leofang force-pushed the cache_artifacts branch 2 times, most recently from 1cb91fe to 3abdab5 Compare January 6, 2025 19:21
@leofang
Copy link
Member Author

leofang commented Jan 6, 2025

/ok to test

@leofang
Copy link
Member Author

leofang commented Jan 6, 2025

/ok to test

1 similar comment
@leofang
Copy link
Member Author

leofang commented Jan 6, 2025

/ok to test

@leofang
Copy link
Member Author

leofang commented Jan 6, 2025

/ok to test

@leofang
Copy link
Member Author

leofang commented Jan 6, 2025

/ok to test

@leofang
Copy link
Member Author

leofang commented Jan 6, 2025

/ok to test

@leofang
Copy link
Member Author

leofang commented Jan 9, 2025

/ok to test

@jakirkham
Copy link
Collaborator

Note that while the Cache space is limited to 10 GB per repo, for our purpose (we have small wheels) it is still OK as a stop-gap solution, until our DevOp team finds a more sustainable one.

Think there is a way to setup a retention policy in repo settings or the workflow. So this might be an option to help manage size

Should add without this setting artifacts stick around forever. Removing them in the UI is one at a time. So something to be aware of

@leofang
Copy link
Member Author

leofang commented Jan 9, 2025

/ok to test

@leofang
Copy link
Member Author

leofang commented Jan 9, 2025

Thanks, @jakirkham! None of GHA Cache related comments are relevant anymore since I am moving away from it (#357 (comment)). I'll update the PR title/description once the CI is green.

@leofang
Copy link
Member Author

leofang commented Jan 9, 2025

/ok to test

@leofang leofang changed the title Upload build artifacts to GHA Cache when merged to main Download build artifacts generated from the backport branch for testing in the main branch Jan 9, 2025
@leofang leofang changed the title Download build artifacts generated from the backport branch for testing in the main branch Download build artifacts from the backport branch for testing in the main branch Jan 9, 2025
@leofang
Copy link
Member Author

leofang commented Jan 9, 2025

OK commit 75e37bd is basically a rewrite of the whole PR. The result is indeed a lot cleaner as expected. Since we have the capability of fetching artifacts generated from the backport branch (11.8.x), there is no need to introduce any concept such as a global cache as discussed earlier, and therefore we can avoid using the GHA Cache which adds some complexity.

The new logic is simply:

  • For testing against CTK 12.x:
    • Run tests of cuda.core and cuda.bindings against cuda.bindings built from the main branch
  • For testing against CTK 11.x:
    • Run tests of cuda.core against the most recent successful build of cuda.bindings from the 11.8.x branch
    • tests of cuda.bindings are run in the 11.8.x branch when we backport a PR

@ksimpson-work @vzhurba01 this is ready for review.

@leofang leofang marked this pull request as ready for review January 9, 2025 03:21
@leofang leofang requested a review from ksimpson-work January 9, 2025 03:24
Comment on lines -326 to -328
if [[ ${{ matrix.python-version }} == "3.13" ]]; then
# TODO: remove this hack once cuda-python has a cp313 build
if [[ $SKIP_CUDA_BINDINGS_TEST == 1 ]]; then
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this hack can be removed now that we generate Python 3.13 wheels for cuda.bindings 11.8 and can retrieve them in the CI; we do not need them published on PyPI in order to use them!

@ksimpson-work
Copy link
Contributor

Correct me if I'm wrong. I want to understand this well.

I see two cases:

One where you are backporting something to 11.8.x where you would want to test cuda core against the active 11.8.x CI build, in which case you would want to target the latest 11.8.x if it was successfully built, or bail out if there were build errors.

Second case is making a change in main, specifically to cuda.core, in which case you would want to test against, not the latest successful CI on 11.8.x, but the top of the 11.8.x tree. This is because if someone was simultaneously testing an 11.8.x change, you might test against an 11.8.x version that is different from what a user would be installing.

From my understanding of this change, there's a race condition between (11.8.x CI workflows + merges) and main workflows.

WDYT?

@leofang
Copy link
Member Author

leofang commented Jan 9, 2025

Race condition is a legit concern but it is still better than the status quo (no integration test against the head of the backport branch). Moreover, we will set up a nightly CI to reduce the risk (#294) and we already have pre-release QA as the final defense line, so I think it is not very risky and can be improved once our DevOp team take over and iterate toward a more robust implementation.

In the first case, if a backport is relevant for cuda.core to work, cuda.core tests would fail unless the backport is merged and rebuilt. So we will know what's going on without a silent green light.

The second case is where the race condition could happen IIUC ("which 11.8 build am I testing against?").

@ksimpson-work
Copy link
Contributor

Ok, I understand that it is a catch 22, and agree that testing against the latest success is far more robust than not testing at all. I just wanted to verbalize that to ensure I correctly understood + make sure we understood that there is a possible improvement there for the DevOps team to address in the future. LGTM

@leofang
Copy link
Member Author

leofang commented Jan 9, 2025

to ensure I correctly understood + make sure we understood that there is a possible improvement there for the DevOps team to address in the future.

Yes all great questions here! You made me think twice (and long enough to seek for an alternative solution). This is why we need code review 😄 Thanks, Keenan!

@leofang leofang merged commit 4cbab16 into NVIDIA:main Jan 9, 2025
47 checks passed
@leofang leofang deleted the cache_artifacts branch January 9, 2025 21:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI/CD CI/CD infrastructure feature New feature or request P0 High priority - Must do!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI: Cache artifacts generated on the main & 11.8.x branches
3 participants