Skip to content

Conversation

@jameslamb
Copy link
Member

@jameslamb jameslamb commented Sep 22, 2025

Description

Looking closely at the CI dependency graphs while working on #122, I noticed some opportunities to improve the end-to-end time for PR CI and nightly builds.

wheel-build-{libB} jobs only need to wait for wheel-build-{libA} jobs to complete if building libB wheels requires libA. This updates job dependencies here to follow that rule.

  • wheel-build-cuopt-server: start building immediately
  • wheel-build-cuopt-sh-client: start building immediately
  • build-images: adds wheel-build-cupot-sh-client to jobs this needs to wait for (branch / nightly builds only)

This PR also remove some unnecessary builds. This project is currently running 16 wheel-build-cuopt-server jobs (2 CUDA major x 2 CPU architectures x 4 Python versions) when it only needs 2 (1 per CUDA major version), because it publishes py3-none-any wheels (https://pypi.anaconda.org/rapidsai-wheels-nightly/simple/cuopt-server-cu13).

  • removes duplicate cuopt-server builds
  • fixes artifact downloads of cuopt-sh-client in nightly tests

Issue

Noticed while working towards #122

Notes for Reviewers

Benefits of these changes

Reduced end-to-end time for PR builds and probably for branch/nightly builds.

Reduced risk of failed or incorrect container-image builds resulting from build-images starting before there are cuopt-sh-client packages available.

14 fewer wheel-build-cuopt-server jobs per PR 😁

Checklist

  • I am familiar with the Contributing Guidelines.
  • Testing
    • New or existing tests cover these changes
    • Added tests
    • Created an issue to follow-up
    • NA
  • Documentation
    • The documentation is up to date with these changes
    • Added new documentation
    • NA

@jameslamb jameslamb added non-breaking Introduces a non-breaking change improvement Improves an existing functionality labels Sep 22, 2025
@copy-pr-bot

This comment was marked as resolved.

package-name: cuopt_sh_client
package-type: python
append-cuda-suffix: false
pure-wheel: true
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should fix the failures in the nightly wheel tests:

[rapids-github-run-id] Querying the GitHub API to determine relevant run of 'build.yaml'.
Downloading and decompressing cuopt_wheel_python_cuopt_sh_client from Run ID 17905116634 into /tmp/tmp.hYHfRQWjjq
no artifact matches any of the names or patterns provided

RAPIDS logger » [09/22/25 05:08:14]
┌──────────────────────────────────────────────────────────────────────────────────┐
|    rapids-retry: retry 1 of 3 | exit code: (1) -> sleeping for 120 seconds...    |

(build link)

@jameslamb jameslamb changed the title WIP: build pure-Python wheels without waiting for dependencies build pure-Python wheels without waiting for dependencies Sep 22, 2025
@jameslamb jameslamb marked this pull request as ready for review September 23, 2025 01:07
@jameslamb jameslamb requested a review from a team as a code owner September 23, 2025 01:07
@jameslamb jameslamb requested a review from bdice September 23, 2025 01:07
@jameslamb
Copy link
Member Author

Most conda-based Python tests were failing with errors like this:

/opt/conda/envs/test/lib/python3.13/site-packages/raft_dask/common/comms.py:28: in
from .comms_utils import (
E ImportError: /opt/conda/envs/test/lib/python3.13/site-packages/raft_dask/common/comms_utils.cpython-313-x86_64-linux-gnu.so: undefined symbol: ZN4ucxx8Endpoint7tagSendEPvmNS_3TagEbSt8functionIFv12ucs_status_tSt10shared_ptrIvEEES6

(conda-python-tests link)

Saw that across all CUDA versions, Python versions, and CPU architectures.

I suspect that the root caused might rapidsai/ucxx#516

And that that'd be fixed once we get new raft / raft-dask packages. That ucxx PR was merged 5 hours ago and the latest build of raft packages was 6 hours ago (https://github.com/rapidsai/raft/actions/workflows/build.yaml).

I triggered a new build of raft to try to help with it: https://github.com/rapidsai/raft/actions/runs/17933069800

@jameslamb
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit 0ac70df into NVIDIA:branch-25.10 Sep 23, 2025
116 of 124 checks passed
@jameslamb jameslamb deleted the more-build-parallelism branch September 23, 2025 15:55
rapids-bot bot pushed a commit that referenced this pull request Sep 24, 2025
Replaces #359 (my more-complicated earlier attempt at this)

This project runs nightly builds and tests on a cron schedule:

https://github.com/NVIDIA/cuopt/blob/36a6a1c0edf42cec2cf07c6be3f16531f33515de/.github/workflows/nightly.yaml#L1-L6

Tests need to wait for builds to finish, and that's currently done with some shell scripts that hit the GitHub API, using a mix of `sleep` and polling.

This has sometimes resulted in nightly failures (network errors, timeouts, etc.). This PR proposes reducing the risk of such failures by moving that logic into GitHub Actions configuration directly, specifically:

* making `build.yaml` trigger `test.yaml` with the GitHub CLI **only after all package builds and publishing have finished**

## Issue

Contributes to #122

## Notes for Reviewers

### How I tested this

I manually triggered this run of the "Trigger Nightly cuOpt Pipeline": https://github.com/NVIDIA/cuopt/actions/runs/17935159871

Which triggered this `build` run: https://github.com/NVIDIA/cuopt/actions/runs/17935161536

Which triggered this `test` run:  https://github.com/NVIDIA/cuopt/actions/runs/17936474025

Things look ok to me!

The `test` run was triggered until after all the relevant package builds and uploads were done, and BEFORE the docker image builds were done (as intended, to not be delayed waiting on them).

There are some test failures from artifact-downloading, like this:

```text
[rapids-github-run-id] Querying the GitHub API to determine relevant run of 'build.yaml'.
Downloading and decompressing cuopt_wheel_python_cuopt_server_cu12_py312_x86_64 from Run ID 17936253863 into /tmp/tmp.pqrBXIhMlP
```

But I think they'll be fixed by merging #409 

And the naming changes for the image builds look good 😁 

<img width="317" height="203" alt="image" src="https://github.com/user-attachments/assets/31bac7bd-1c4d-4c31-9ce9-9863778c2e89" />

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Ramakrishnap (https://github.com/rgsl888prabhu)

URL: #408
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants