Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate to NVKS for amd64 CI runners #6280

Merged
merged 4 commits into from
Feb 3, 2025

Conversation

bdice
Copy link
Contributor

@bdice bdice commented Jan 30, 2025

This migrates amd64 CI jobs (PRs and nightlies) to use L4 GPUs from the NVKS cluster.

xref: https://github.com/rapidsai/build-infra/issues/184

@bdice bdice requested a review from a team as a code owner January 30, 2025 18:28
@bdice bdice requested a review from msarahan January 30, 2025 18:28
@bdice bdice added non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels Jan 30, 2025
@jakirkham
Copy link
Member

Is there something still needed for CUDA 12.8 here?

Seeing the following error on CI:

Traceback (most recent call last):
  File "/opt/conda/bin/rapids-dependency-file-generator", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/rapids_dependency_file_generator/_cli.py", line 125, in main
    make_dependency_files(
  File "/opt/conda/lib/python3.12/site-packages/rapids_dependency_file_generator/_rapids_dependency_file_generator.py", line 474, in make_dependency_files
    raise ValueError(f"No matching matrix found in '{include}' for: {matrix_combo}")
ValueError: No matching matrix found in 'cuda_version' for: {'cuda': '12.8', 'arch': 'x86_64'}

Maybe we need to resolve the forward merger: #6272

@dantegd
Copy link
Member

dantegd commented Jan 31, 2025

@bdice the current failures are because the jobs are picking a build of cudf nightly that doesn't have the fixes that 25.02a358 has

@jameslamb
Copy link
Member

The most recent cuDF branch build finished a couple minutes ago (https://github.com/rapidsai/cudf/actions/runs/13078947512). I've merged brnach-25.04 in here to restart CI and pull in those new packages.

@bdice
Copy link
Contributor Author

bdice commented Feb 2, 2025

There are a couple CI tests failing like this:

=========================== short test summary info ============================
FAILED estimators_hyperparams/test_accel_kmeans.py::test_kmeans_random_state - AssertionError: assert not True
 +  where True = <function allclose at 0x7fb56e35dff0>(array([[-2.63323268,  9.04356978],\n       [-6.88387179, -6.98398415],\n       [ 4.74710337,  2.01059427]]), array([[-2.63323268,  9.04356978],\n       [-6.88387179, -6.98398415],\n       [ 4.74710337,  2.01059427]]))
 +    where <function allclose at 0x7fb56e35dff0> = np.allclose
 +    and   array([[-2.63323268,  9.04356978],\n       [-6.88387179, -6.98398415],\n       [ 4.74710337,  2.01059427]]) = wrapped <class 'sklearn.cluster._kmeans.KMeans'>.cluster_centers_
 +    and   array([[-2.63323268,  9.04356978],\n       [-6.88387179, -6.98398415],\n       [ 4.74710337,  2.01059427]]) = wrapped <class 'sklearn.cluster._kmeans.KMeans'>.cluster_centers_
====== 1 failed, 430 passed, 5 xfailed, 6 xpassed, 18 warnings in 26.17s =======

@bdice bdice requested a review from a team as a code owner February 3, 2025 14:43
@bdice bdice requested review from betatim and divyegala February 3, 2025 14:43
@github-actions github-actions bot added the Cython / Python Cython or Python issue label Feb 3, 2025
Comment on lines -103 to -105
kmeans3 = KMeans(n_clusters=3, random_state=24).fit(X)
# With different random_state, results might differ
assert not np.allclose(kmeans1.cluster_centers_, kmeans3.cluster_centers_)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to remove this part of the test. It is a bit unclear what it is trying to check because the cluster centers might or might not be the same if we use a different seed. This means that all we can assert here is maybe_allclose, which doesn't seem like a useful assertion.

So 👍 to removing it

Copy link
Member

@betatim betatim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code changes look good to me.

Workflow changes look reasonable, branch names are consistently changed, but hopefully someone who knows has already looked at this :D

@jameslamb jameslamb removed the request for review from msarahan February 3, 2025 15:21
@bdice
Copy link
Contributor Author

bdice commented Feb 3, 2025

/merge

@bdice
Copy link
Contributor Author

bdice commented Feb 3, 2025

CI failed in an optional job:

 =========================== short test summary info ============================
FAILED test_device_selection.py::test_kmeans_methods[cpu-cpu] - assert 0.865575773037488 >= 0.9
 +  where 0.865575773037488 = adjusted_rand_score(array([ 4, 19,  1, 14, 18,  9,  9, 13,  1,  7, 14,  9,  2, 18,  2, 18, 16,\n       15, 11, 18, 17,  8,  3, 15, 10, 13, ...2, 16,  1, 11,  1, 11,  6,  1,  8, 13,  7, 11,\n        2, 10,  2,  1,  6,  9, 18,  0,  5,  5,  6,  0, 15], dtype=int32), array([ 3, 10,  1,  6,  9,  7,  7, 15,  1,  5,  6,  7, 11,  9, 11,  9, 14,\n       15, 18,  9, 11,  4, 19, 15,  8, 15, ...6, 14,  1,  0,  1,  0, 13,  1,  4, 15,  5, 18,\n       11,  8, 11,  1, 13,  7,  9, 12, 17, 17, 13, 12, 15], dtype=int32))
= 1 failed, 3882 passed, 192 skipped, 80 xfailed, 37 xpassed, 330 warnings in 2809.43s (0:46:49) =

@rapids-bot rapids-bot bot merged commit 57aa1b7 into rapidsai:branch-25.04 Feb 3, 2025
69 of 70 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Cython / Python Cython or Python issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants