Download build artifacts from the backport branch for testing in the `main` branch #357

leofang · 2025-01-06T16:46:08Z

Close #329.

Update: Please see #357 (comment).

Refresher: There are two kinds of caches that we can use in GHA, Cache and Artifacts. We've been using Artifacts to store build artifacts, which works fine so far but the main issue is the artifacts are scoped on a per-PR basis, meaning they cannot be reused across CI workflow runs triggered by different PRs.

This PR adds the capability of uploading artifacts to the Cache space when a PR is merged into the main branch, so that they can serve as a fallback when a workflow needs certain artifacts for whatever reason. Note that while the Cache space is limited to 10 GB per repo, for our purpose (we have small wheels) it is still OK as a stop-gap solution, until our DevOp team finds a more sustainable one.

I also cleaned up the shell choice a bit so that all job steps use the same setting.

copy-pr-bot · 2025-01-06T16:46:12Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

.github/workflows/build-and-test.yml

leofang · 2025-01-06T17:01:58Z

/ok to test

leofang · 2025-01-06T17:15:27Z

/ok to test

leofang · 2025-01-06T17:17:29Z

/ok to test

leofang · 2025-01-06T18:42:51Z

/ok to test

leofang · 2025-01-06T19:21:38Z

/ok to test

leofang · 2025-01-06T19:29:20Z

/ok to test

leofang · 2025-01-06T20:04:05Z

/ok to test

leofang · 2025-01-06T20:09:53Z

/ok to test

leofang · 2025-01-06T20:15:58Z

/ok to test

leofang · 2025-01-06T20:33:17Z

/ok to test

leofang · 2025-01-09T02:10:33Z

/ok to test

jakirkham · 2025-01-09T02:22:31Z

Note that while the Cache space is limited to 10 GB per repo, for our purpose (we have small wheels) it is still OK as a stop-gap solution, until our DevOp team finds a more sustainable one.

Think there is a way to setup a retention policy in repo settings or the workflow. So this might be an option to help manage size

Should add without this setting artifacts stick around forever. Removing them in the UI is one at a time. So something to be aware of

leofang · 2025-01-09T02:27:46Z

/ok to test

leofang · 2025-01-09T02:39:55Z

Thanks, @jakirkham! None of GHA Cache related comments are relevant anymore since I am moving away from it (#357 (comment)). I'll update the PR title/description once the CI is green.

leofang · 2025-01-09T02:44:35Z

/ok to test

leofang · 2025-01-09T03:21:34Z

OK commit 75e37bd is basically a rewrite of the whole PR. The result is indeed a lot cleaner as expected. Since we have the capability of fetching artifacts generated from the backport branch (11.8.x), there is no need to introduce any concept such as a global cache as discussed earlier, and therefore we can avoid using the GHA Cache which adds some complexity.

The new logic is simply:

For testing against CTK 12.x:
- Run tests of cuda.core and cuda.bindings against cuda.bindings built from the main branch
For testing against CTK 11.x:
- Run tests of cuda.core against the most recent successful build of cuda.bindings from the 11.8.x branch
- tests of cuda.bindings are run in the 11.8.x branch when we backport a PR

@ksimpson-work @vzhurba01 this is ready for review.

leofang · 2025-01-09T03:31:14Z

.github/workflows/build-and-test.yml

-          if [[ ${{ matrix.python-version }} == "3.13" ]]; then
-            # TODO: remove this hack once cuda-python has a cp313 build
-            if [[ $SKIP_CUDA_BINDINGS_TEST == 1 ]]; then


Note: this hack can be removed now that we generate Python 3.13 wheels for cuda.bindings 11.8 and can retrieve them in the CI; we do not need them published on PyPI in order to use them!

ksimpson-work · 2025-01-09T18:09:01Z

Correct me if I'm wrong. I want to understand this well.

I see two cases:

One where you are backporting something to 11.8.x where you would want to test cuda core against the active 11.8.x CI build, in which case you would want to target the latest 11.8.x if it was successfully built, or bail out if there were build errors.

Second case is making a change in main, specifically to cuda.core, in which case you would want to test against, not the latest successful CI on 11.8.x, but the top of the 11.8.x tree. This is because if someone was simultaneously testing an 11.8.x change, you might test against an 11.8.x version that is different from what a user would be installing.

From my understanding of this change, there's a race condition between (11.8.x CI workflows + merges) and main workflows.

WDYT?

leofang · 2025-01-09T18:31:35Z

Race condition is a legit concern but it is still better than the status quo (no integration test against the head of the backport branch). Moreover, we will set up a nightly CI to reduce the risk (#294) and we already have pre-release QA as the final defense line, so I think it is not very risky and can be improved once our DevOp team take over and iterate toward a more robust implementation.

In the first case, if a backport is relevant for cuda.core to work, cuda.core tests would fail unless the backport is merged and rebuilt. So we will know what's going on without a silent green light.

The second case is where the race condition could happen IIUC ("which 11.8 build am I testing against?").

ksimpson-work · 2025-01-09T18:38:29Z

Ok, I understand that it is a catch 22, and agree that testing against the latest success is far more robust than not testing at all. I just wanted to verbalize that to ensure I correctly understood + make sure we understood that there is a possible improvement there for the DevOps team to address in the future. LGTM

leofang · 2025-01-09T21:36:36Z

to ensure I correctly understood + make sure we understood that there is a possible improvement there for the DevOps team to address in the future.

Yes all great questions here! You made me think twice (and long enough to seek for an alternative solution). This is why we need code review 😄 Thanks, Keenan!

.github/workflows/backport.yml

leofang added P0 High priority - Must do! CI/CD CI/CD infrastructure feature New feature or request to-be-backported Trigger the bot to raise a backport PR upon merge labels Jan 6, 2025

leofang self-assigned this Jan 6, 2025

leofang force-pushed the cache_artifacts branch 2 times, most recently from c05b589 to f6ec8f2 Compare January 6, 2025 16:56

leofang commented Jan 6, 2025

View reviewed changes

.github/workflows/build-and-test.yml Outdated Show resolved Hide resolved

leofang force-pushed the cache_artifacts branch 2 times, most recently from f527b33 to d34a5c2 Compare January 6, 2025 17:14

leofang force-pushed the cache_artifacts branch from d34a5c2 to 8c360d9 Compare January 6, 2025 17:17

leofang force-pushed the cache_artifacts branch 2 times, most recently from ae72bed to 70651ea Compare January 6, 2025 18:39

leofang force-pushed the cache_artifacts branch 2 times, most recently from 1cb91fe to 3abdab5 Compare January 6, 2025 19:21

leofang force-pushed the cache_artifacts branch from 3abdab5 to f113f50 Compare January 6, 2025 19:28

leofang force-pushed the cache_artifacts branch from bebabaf to f113f50 Compare January 6, 2025 20:08

leofang force-pushed the cache_artifacts branch from 0852f3f to 915d31d Compare January 6, 2025 20:15

leofang force-pushed the cache_artifacts branch from 915d31d to fb02af2 Compare January 6, 2025 20:32

leofang added 4 commits January 8, 2025 20:47

upload artifacts to GHA Cache when merged to main

3ba670f

also update doc build workflow

af10179

always draft a backport PR to make it easier

f3a9991

ensure zstd is installed in all stages that could access GHA cache

499d40c

leofang force-pushed the cache_artifacts branch from 9d9c1cb to 0df5408 Compare January 9, 2025 01:48

leofang force-pushed the cache_artifacts branch from 0df5408 to 3f811af Compare January 9, 2025 02:27

implement a custom download-artifact step to simplify the logic

75e37bd

leofang force-pushed the cache_artifacts branch from 3f811af to 75e37bd Compare January 9, 2025 02:44

leofang changed the title ~~Upload build artifacts to GHA Cache when merged to main~~ Download build artifacts generated from the backport branch for testing in the main branch Jan 9, 2025

leofang changed the title ~~Download build artifacts generated from the backport branch for testing in the main branch~~ Download build artifacts from the backport branch for testing in the main branch Jan 9, 2025

leofang marked this pull request as ready for review January 9, 2025 03:21

leofang requested a review from ksimpson-work January 9, 2025 03:24

leofang commented Jan 9, 2025

View reviewed changes

This was referenced Jan 9, 2025

CI: Cache artifacts generated on the main & 11.8.x branches #329

Closed

Test against CUDA wheels #368

Merged

ksimpson-work approved these changes Jan 9, 2025

View reviewed changes

leofang merged commit 4cbab16 into NVIDIA:main Jan 9, 2025
47 checks passed

leofang deleted the cache_artifacts branch January 9, 2025 21:36

leofang mentioned this pull request Jan 12, 2025

CI: Detect if build/test is needed #299

Open

leofang commented Jan 12, 2025

View reviewed changes

.github/workflows/backport.yml Show resolved Hide resolved

Download build artifacts from the backport branch for testing in the main branch #357

Download build artifacts from the backport branch for testing in the main branch #357

Uh oh!

Conversation

leofang commented Jan 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Jan 6, 2025

Uh oh!

Uh oh!

leofang commented Jan 6, 2025

Uh oh!

leofang commented Jan 6, 2025

Uh oh!

leofang commented Jan 6, 2025

Uh oh!

leofang commented Jan 6, 2025

Uh oh!

leofang commented Jan 6, 2025

Uh oh!

leofang commented Jan 6, 2025

Uh oh!

leofang commented Jan 6, 2025

Uh oh!

leofang commented Jan 6, 2025

Uh oh!

leofang commented Jan 6, 2025

Uh oh!

leofang commented Jan 6, 2025

Uh oh!

leofang commented Jan 9, 2025

Uh oh!

jakirkham commented Jan 9, 2025

Uh oh!

leofang commented Jan 9, 2025

Uh oh!

leofang commented Jan 9, 2025

Uh oh!

leofang commented Jan 9, 2025

Uh oh!

leofang commented Jan 9, 2025

Uh oh!

leofang Jan 9, 2025

Choose a reason for hiding this comment

Uh oh!

ksimpson-work commented Jan 9, 2025

Uh oh!

leofang commented Jan 9, 2025

Uh oh!

ksimpson-work commented Jan 9, 2025

Uh oh!

leofang commented Jan 9, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Download build artifacts from the backport branch for testing in the `main` branch #357

Download build artifacts from the backport branch for testing in the `main` branch #357

leofang commented Jan 6, 2025 •

edited

Loading