Skip to content

Commit

Permalink
Add workflow to build RAPIDS from source with local CCCL (#1667)
Browse files Browse the repository at this point in the history
* add workflow to build RAPIDS repos from source with local CCCL

* disable using cumlprims_mg in cuml and cugraph-ops in cugraph

* remove redundant mounts

* add permissions

* add permissions again

* more permissions

* copy build-in-devcontainer.yml into cccl workflows

* update workspaceFolder path and add default conda/venv name

* add rapids.Dockerfile

* remove cuda12.2-pip devcontainer

* only build RAPIDS C++ libs

* build without tests and benchmarks first, then build with them

* build separate RAPIDS libs in parallel

* move RAPIDS devcontainer into ci/rapids

* -DBUILD_SHARED_LIBS=ON

* fix typo

* fix rapids.Dockerfile location

* put -v and -j at the front

* debug init-ssh-deploy-keys call

* use the debug envvar

* fix yq filter to match cpp name

* fix filters list

* add missing -D prefix

* remove debug code

* fix args

* build wholegraph before cugraph

* explicitly reconfigure

* use miscco/cudf fork with fixes for CCCL main

* use cuco fork with fixes for CCCL main

* define CCCL and cuCollections via rapids-cmake versions.json override so CPM applies patches

* escape quotes

* replace git: with https: in CCCL git_url

* temporarily comment out the rest of the PR job

* always clone cuco, use my cuml and cugraph forks

* use my cuspatial fork

* build cugraph with less parallelism

* build with tests and benchmarks enabled

* only build for sm_70

* build with/without tests again

* add build-rapids job to nightly workflow

* uncomment the rest of the CI jobs

* build cugraph multi-gpu tests

* remove nightly schedule from build-rapids.yml

* add problem matcher to build-rapids job

* use my rapids-cmake fork with updates for CCCL 2.5

* * Update launch.sh to read workspaceFolder, runArgs, initializeCommand, containerEnv, and mounts from devcontainer.json
* Add a docker-entrypoint.sh entrypoint script to change the non-root user and group

* move logic for updating manifest.yaml and cloning repos into the container and controlled by envvars, simplify build-rapids.yml and print a command to execute on failure

* don't run post-attach-command in CI

* update rapids container name

* comment out most PR jobs again

* remove --no-update-env

* fix typo

* set -x

* remove quotes

* use branch-24.06 again

* debug clone

* always generate scripts

* remove debug flags and enable full pr workflow again

* switch cudf and cuml to rapidsai branch-24.06, add full set of library overrides to build-rapids.yml

* fix ucxx branch name

* switch cugraph to rapidsai branch-24.06

* parse localEnv entries with default values

* print prettier failure message

* remove set -x

* fix problem matcher path

* remove adding problem matcher because it's added in the other workflow

* fix here-doc EOF

* cleanup parsing in launch.json, make docker-entrypoint.sh faster, move common logic from rapids-entrypoint.sh into cccl-entrypoint.sh

* more parsing cleanup and hardening

* run with --gpus all

* determine remote user from devcontainer.json or image metadata, gpus from hostRequirements.gpu in devcontainer.json

* add --gpus option to launch.json to allow overriding devcontainer.json hostRequirements.gpu in CI

* update to RAPIDS branch-24.08

* support -e|--env in launch.sh so CI can pass additional container envvars

* support -v|--volume in launch.sh so CI can pass additional container volumes

* merge in changes from other branch

* fix docker-entrypoint.sh for Ubuntu 18.04

* Update .github/workflows/build-rapids.yml

Co-authored-by: Bradley Dice <bdice@bradleydice.com>

* refactor JSON parsing to use python json module

* switch to rapidsai/rapids-cmake branch-24.08

* always recreate the conda env from scratch on container startup

* Clone the default rapidsai/devcontainers branch

* remove RAPIDS_TEST_OPTIONS as it's safe to just set them all regardless of which libraries are being built

* remove CCCL version from override JSON and tell rapids-cmake to always "download" CCCL from the local clone

* comment out overrides and leave a note about how to customize RAPIDS repo git details

* use exact CCCL commit hash

* temporarily disable all PR jobs except build-rapids

* 24.6 -> 24.8

* update build-rapids.yml to use launch.sh

* write aws config to local .aws dir

* change sub job name

* delete build-in-devcontainer.yml

* Revert "temporarily disable all PR jobs except build-rapids"

This reverts commit 18b2caf.

* remove set -x

* put all the envvars on one line

* check that SSH_AUTH_SOCK exists before mounting it

* unset VIRTUAL_ENV and VIRTUAL_ENV_PROMPT so that the shell init files reactivate the CCCL env for the non-root user

---------

Co-authored-by: Bradley Dice <bdice@bradleydice.com>
Co-authored-by: Allison Piper <alliepiper16@gmail.com>
Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>
  • Loading branch information
4 people authored Jun 14, 2024
1 parent 41beb0e commit 1b75250
Show file tree
Hide file tree
Showing 10 changed files with 414 additions and 1 deletion.
1 change: 1 addition & 0 deletions .devcontainer/cuda12.2-rapids-conda
2 changes: 2 additions & 0 deletions .devcontainer/docker-entrypoint.sh
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,8 @@ else
#
# We cannot use `su -w` because that's not supported by the `su` in Ubuntu18.04, so we reset the following
# environment variables to the expected values, then pass through everything else from the startup environment.
export VIRTUAL_ENV=;
export VIRTUAL_ENV_PROMPT=;
export HOME="$HOME_FOLDER";
export XDG_CACHE_HOME="$HOME_FOLDER/.cache";
export XDG_CONFIG_HOME="$HOME_FOLDER/.config";
Expand Down
2 changes: 1 addition & 1 deletion .devcontainer/launch.sh
Original file line number Diff line number Diff line change
Expand Up @@ -217,7 +217,7 @@ launch_docker() {
RUN_ARGS+=(--entrypoint "${WORKSPACE_FOLDER:-/home/coder/cccl}/.devcontainer/docker-entrypoint.sh")
fi

if test -n "${SSH_AUTH_SOCK:-}"; then
if test -n "${SSH_AUTH_SOCK:-}" && test -e "${SSH_AUTH_SOCK:-}"; then
ENV_VARS+=(--env "SSH_AUTH_SOCK=/tmp/ssh-auth-sock")
MOUNTS+=(--mount "source=${SSH_AUTH_SOCK},target=/tmp/ssh-auth-sock,type=bind")
fi
Expand Down
158 changes: 158 additions & 0 deletions .github/workflows/build-rapids.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
name: Build all RAPIDS repositories

on:
workflow_call:

jobs:
check-event:
name: Check GH Event
runs-on: ubuntu-latest
outputs:
ok: ${{ steps.check_gh_event.outputs.ok }}
steps:
- id: check_gh_event
name: Check GH Event
shell: bash
run: |
[[ '${{ github.event_name }}' == 'push' && '${{ github.repository }}' == 'NVIDIA/cccl' ]] || \
[[ '${{ github.event_name }}' == 'schedule' && '${{ github.repository }}' == 'NVIDIA/cccl' ]] || \
[[ '${{ github.event_name }}' == 'pull_request' && '${{ github.repository }}' != 'NVIDIA/cccl' ]] \
&& echo "ok=true" | tee -a $GITHUB_OUTPUT \
|| echo "ok=false" | tee -a $GITHUB_OUTPUT;
build-rapids:
name: "${{ matrix.libs }}"
if: needs.check-event.outputs.ok == 'true'
needs: check-event
runs-on: ${{ fromJSON(github.repository != 'NVIDIA/cccl' && '"ubuntu-latest"' || '"linux-amd64-cpu32"') }}
strategy:
fail-fast: false
matrix:
include:
- { cuda: '12.2', libs: 'rmm KvikIO cudf cudf_kafka cuspatial', }
- { cuda: '12.2', libs: 'rmm ucxx raft cuvs', }
- { cuda: '12.2', libs: 'rmm ucxx raft cumlprims_mg cuml', }
- { cuda: '12.2', libs: 'rmm ucxx raft cugraph-ops wholegraph cugraph' }
permissions:
id-token: write
contents: read
steps:
- name: Checkout repo
uses: actions/checkout@v4
with:
fetch-depth: 0
persist-credentials: false
- name: Add NVCC problem matcher
run: echo "::add-matcher::$(pwd)/.github/problem-matchers/problem-matcher.json"
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::279114543810:role/gha-oidc-NVIDIA
aws-region: us-east-2
role-duration-seconds: 43200 # 12h
- name: Run command # Do not change this step's name, it is checked in parse-job-times.py
env:
CI: true
RAPIDS_LIBS: ${{ matrix.libs }}
# Uncomment any of these to customize the git repo and branch for a RAPIDS lib:
# RAPIDS_cmake_GIT_REPO: '{"upstream": "rapidsai", "tag": "branch-24.08"}'
# RAPIDS_cudf_GIT_REPO: '{"upstream": "rapidsai", "tag": "branch-24.08"}'
# RAPIDS_cudf_kafka_GIT_REPO: '{"upstream": "rapidsai", "tag": "branch-24.08"}'
# RAPIDS_cugraph_GIT_REPO: '{"upstream": "rapidsai", "tag": "branch-24.08"}'
# RAPIDS_cugraph_ops_GIT_REPO: '{"upstream": "rapidsai", "tag": "branch-24.08"}'
# RAPIDS_cuml_GIT_REPO: '{"upstream": "rapidsai", "tag": "branch-24.08"}'
# RAPIDS_cumlprims_mg_GIT_REPO: '{"upstream": "rapidsai", "tag": "branch-24.08"}'
# RAPIDS_cuspatial_GIT_REPO: '{"upstream": "rapidsai", "tag": "branch-24.08"}'
# RAPIDS_cuvs_GIT_REPO: '{"upstream": "rapidsai", "tag": "branch-24.08"}'
# RAPIDS_KvikIO_GIT_REPO: '{"upstream": "rapidsai", "tag": "branch-24.08"}'
# RAPIDS_raft_GIT_REPO: '{"upstream": "rapidsai", "tag": "branch-24.08"}'
# RAPIDS_rmm_GIT_REPO: '{"upstream": "rapidsai", "tag": "branch-24.08"}'
# RAPIDS_ucxx_GIT_REPO: '{"upstream": "rapidsai", "tag": "branch-0.39"}'
# RAPIDS_wholegraph_GIT_REPO: '{"upstream": "rapidsai", "tag": "branch-24.08"}'
run: |
cat <<"EOF" > "$RUNNER_TEMP/ci-entrypoint.sh"
#! /usr/bin/env bash
# Start the ssh-agent and add the repo deploy keys
if ! pgrep ssh-agent >/dev/null 2>&1; then eval "$(ssh-agent -s)"; fi
ssh-add - <<< '${{ secrets.RAPIDSAI_CUMLPRIMS_DEPLOY_KEY }}'
ssh-add - <<< '${{ secrets.RAPIDSAI_CUGRAPH_OPS_DEPLOY_KEY }}'
devcontainer-utils-init-ssh-deploy-keys || true
exec "$@"
EOF
cat <<"EOF" > "$RUNNER_TEMP/ci.sh"
#! /usr/bin/env bash
set -eo pipefail
. ~/cccl/ci/rapids/post-create-command.sh;
declare -a failures
declare -A failures_map
# Configure and build each lib with -DBUILD_TESTS=OFF, then again with -DBUILD_TESTS=ON
for RAPIDS_ENABLE_TESTS in OFF ON; do
_apply_manifest_modifications;
for lib in ${RAPIDS_LIBS}; do
sccache -z
if ! configure-${lib}-cpp || ! build-${lib}-cpp; then
if ! test -v failures_map["${lib}"]; then
failures+=("${lib}")
failures_map["${lib}"]=1
fi
fi
sccache --show-adv-stats
done
done
# Print failures and exit
if test ${#failures[@]} -gt 0; then
echo "::error:: Failures: ${failures[*]}"
echo -e "::group::️❗ \e[1;31mInstructions to Reproduce CI Failure Locally\e[0m"
echo "::error:: To replicate this failure locally, follow the steps below:"
echo "1. Clone the repository, and navigate to the correct branch and commit:"
echo " git clone --branch $GITHUB_REF_NAME --single-branch https://github.com/$GITHUB_REPOSITORY.git && cd $(echo $GITHUB_REPOSITORY | cut -d'/' -f2) && git checkout $GITHUB_SHA"
echo ""
echo "2. Run the failed command inside the same Docker container used by this CI job:"
cat <<__EOF
RAPIDS_LIBS='${RAPIDS_LIBS}'$(for lib in cmake ${RAPIDS_LIBS}; do var=RAPIDS_${lib//-/_}_GIT_REPO; if test -v "$var" && test -n "${!var}"; then echo -n " $var='${!var}'"; fi; done) \\
.devcontainer/launch.sh -d -c ${{matrix.cuda}} -H rapids-conda -- ./ci/rapids/rapids-entrypoint.sh \\
/bin/bash -li -c 'uninstall-all -j -qqq && clean-all -j && build-all -j -v || exec /bin/bash -li'
__EOF
echo ""
echo "For additional information, see:"
echo " - DevContainer Documentation: https://github.com/NVIDIA/cccl/blob/main/.devcontainer/README.md"
echo " - Continuous Integration (CI) Overview: https://github.com/NVIDIA/cccl/blob/main/ci-overview.md"
exit 1
fi
EOF
chmod +x "$RUNNER_TEMP"/ci{,-entrypoint}.sh
mkdir -p .aws
cat <<EOF > .aws/config
[default]
bucket=rapids-sccache-devs
region=us-east-2
EOF
cat <<EOF > .aws/credentials
[default]
aws_access_key_id=$AWS_ACCESS_KEY_ID
aws_session_token=$AWS_SESSION_TOKEN
aws_secret_access_key=$AWS_SECRET_ACCESS_KEY
EOF
chmod 0600 .aws/credentials
chmod 0664 .aws/config
.devcontainer/launch.sh \
--docker \
--cuda ${{matrix.cuda}} \
--host rapids-conda \
--env VAULT_HOST= \
--env "GITHUB_SHA=$GITHUB_SHA" \
--env "GITHUB_REF_NAME=$GITHUB_REF_NAME" \
--env "GITHUB_REPOSITORY=$GITHUB_REPOSITORY" \
--volume "$RUNNER_TEMP/ci.sh:/ci.sh" \
--volume "$RUNNER_TEMP/ci-entrypoint.sh:/ci-entrypoint.sh" \
-- /ci-entrypoint.sh ./ci/rapids/rapids-entrypoint.sh /ci.sh
11 changes: 11 additions & 0 deletions .github/workflows/ci-workflow-nightly.yml
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,17 @@ jobs:
id: check-workflow
uses: ./.github/actions/workflow-results

build-rapids:
name: Build RAPIDS
secrets: inherit
permissions:
actions: read
packages: read
id-token: write
contents: read
pull-requests: read
uses: ./.github/workflows/build-rapids.yml

# Check all other job statuses. This job gates branch protection checks.
ci:
name: CI
Expand Down
11 changes: 11 additions & 0 deletions .github/workflows/ci-workflow-pull-request.yml
Original file line number Diff line number Diff line change
Expand Up @@ -177,6 +177,17 @@ jobs:
upload_workflow_artifact: "true"
upload_pages_artifact: "false"

build-rapids:
name: Build RAPIDS
secrets: inherit
permissions:
actions: read
packages: read
id-token: write
contents: read
pull-requests: read
uses: ./.github/workflows/build-rapids.yml

# Check all other job statuses. This job gates branch protection checks.
ci:
name: CI
Expand Down
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,6 @@ _deps/catch2-src/
_site/
compile_commands.json
CMakeUserPresets.json
/ci/rapids/.conda
/ci/rapids/.log
/ci/rapids/.repos
95 changes: 95 additions & 0 deletions ci/rapids/cuda12.2-conda/devcontainer.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
{
"image": "rapidsai/devcontainers:24.08-cpp-mambaforge-ubuntu22.04",
"runArgs": [
"--rm",
"--name",
"${localEnv:USER:anon}-${localWorkspaceFolderBasename}-rapids-24.08-cuda12.2-conda"
],
"hostRequirements": {"gpu": "optional"},
"features": {
"ghcr.io/rapidsai/devcontainers/features/rapids-build-utils:24.8": {}
},
"overrideFeatureInstallOrder": [
"ghcr.io/rapidsai/devcontainers/features/rapids-build-utils"
],
"containerEnv": {
"CI": "${localEnv:CI}",
"CUDAARCHS": "70-real",
"CUDA_VERSION": "12.2",
"DEFAULT_CONDA_ENV": "rapids",
"PYTHONSAFEPATH": "1",
"PYTHONUNBUFFERED": "1",
"PYTHONDONTWRITEBYTECODE": "1",
"PYTHON_PACKAGE_MANAGER": "conda",
"SCCACHE_REGION": "us-east-2",
"SCCACHE_BUCKET": "rapids-sccache-devs",
"VAULT_HOST": "https://vault.ops.k8s.rapids.ai",
"HISTFILE": "/home/coder/.cache/._bash_history",
"LIBCUDF_KERNEL_CACHE_PATH": "/home/coder/cudf/cpp/build/latest/jitify_cache",
"RAPIDS_LIBS": "${localEnv:RAPIDS_LIBS}",
"RAPIDS_cmake_GIT_REPO": "${localEnv:RAPIDS_cmake_GIT_REPO}",
"RAPIDS_rmm_GIT_REPO": "${localEnv:RAPIDS_rmm_GIT_REPO}",
"RAPIDS_ucxx_GIT_REPO": "${localEnv:RAPIDS_ucxx_GIT_REPO}",
"RAPIDS_kvikio_GIT_REPO": "${localEnv:RAPIDS_kvikio_GIT_REPO}",
"RAPIDS_cudf_GIT_REPO": "${localEnv:RAPIDS_cudf_GIT_REPO}",
"RAPIDS_raft_GIT_REPO": "${localEnv:RAPIDS_raft_GIT_REPO}",
"RAPIDS_cuvs_GIT_REPO": "${localEnv:RAPIDS_cuvs_GIT_REPO}",
"RAPIDS_cumlprims_mg_GIT_REPO": "${localEnv:RAPIDS_cumlprims_mg_GIT_REPO}",
"RAPIDS_cuml_GIT_REPO": "${localEnv:RAPIDS_cuml_GIT_REPO}",
"RAPIDS_cugraph_ops_GIT_REPO": "${localEnv:RAPIDS_cugraph_ops_GIT_REPO}",
"RAPIDS_wholegraph_GIT_REPO": "${localEnv:RAPIDS_wholegraph_GIT_REPO}",
"RAPIDS_cugraph_GIT_REPO": "${localEnv:RAPIDS_cugraph_GIT_REPO}",
"RAPIDS_cuspatial_GIT_REPO": "${localEnv:RAPIDS_cuspatial_GIT_REPO}"
},
"initializeCommand": ["/bin/bash", "-c", "mkdir -m 0755 -p ${localWorkspaceFolder}/.{aws,cache,config} ${localWorkspaceFolder}/ci/rapids/.{conda,log/devcontainer-utils} ${localWorkspaceFolder}/ci/rapids/.repos/{rmm,kvikio,ucxx,cudf,raft,cuvs,cuml,wholegraph,cugraph,cuspatial}"],
"postCreateCommand": ["/bin/bash", "-c", "if [ ${CI:-false} = 'false' ]; then . /home/coder/cccl/ci/rapids/post-create-command.sh; fi"],
"postAttachCommand": ["/bin/bash", "-c", "if [ ${CODESPACES:-false} = 'true' ]; then . devcontainer-utils-post-attach-command; fi"],
"workspaceFolder": "/home/coder/${localWorkspaceFolderBasename}",
"workspaceMount": "source=${localWorkspaceFolder},target=/home/coder/${localWorkspaceFolderBasename},type=bind,consistency=consistent",
"mounts": [
"source=${localWorkspaceFolder}/.aws,target=/home/coder/.aws,type=bind,consistency=consistent",
"source=${localWorkspaceFolder}/.cache,target=/home/coder/.cache,type=bind,consistency=consistent",
"source=${localWorkspaceFolder}/.config,target=/home/coder/.config,type=bind,consistency=consistent",
"source=${localWorkspaceFolder}/ci/rapids/.repos/rmm,target=/home/coder/rmm,type=bind,consistency=consistent",
"source=${localWorkspaceFolder}/ci/rapids/.repos/kvikio,target=/home/coder/kvikio,type=bind,consistency=consistent",
"source=${localWorkspaceFolder}/ci/rapids/.repos/ucxx,target=/home/coder/ucxx,type=bind,consistency=consistent",
"source=${localWorkspaceFolder}/ci/rapids/.repos/cudf,target=/home/coder/cudf,type=bind,consistency=consistent",
"source=${localWorkspaceFolder}/ci/rapids/.repos/raft,target=/home/coder/raft,type=bind,consistency=consistent",
"source=${localWorkspaceFolder}/ci/rapids/.repos/cuvs,target=/home/coder/cuvs,type=bind,consistency=consistent",
"source=${localWorkspaceFolder}/ci/rapids/.repos/cuml,target=/home/coder/cuml,type=bind,consistency=consistent",
"source=${localWorkspaceFolder}/ci/rapids/.repos/wholegraph,target=/home/coder/wholegraph,type=bind,consistency=consistent",
"source=${localWorkspaceFolder}/ci/rapids/.repos/cugraph,target=/home/coder/cugraph,type=bind,consistency=consistent",
"source=${localWorkspaceFolder}/ci/rapids/.repos/cuspatial,target=/home/coder/cuspatial,type=bind,consistency=consistent",
"source=${localWorkspaceFolder}/ci/rapids/.conda,target=/home/coder/.conda,type=bind,consistency=consistent",
"source=${localWorkspaceFolder}/ci/rapids/.log/devcontainer-utils,target=/var/log/devcontainer-utils,type=bind,consistency=consistent"
],
"customizations": {
"vscode": {
"extensions": [
"augustocdias.tasks-shell-input",
"ms-python.flake8",
"nvidia.nsight-vscode-edition"
],
"files.watcherExclude": {
"**/build/**": true,
"**/_skbuild/**": true,
"**/target/**": true,
"/home/coder/.aws/**/*": true,
"/home/coder/.cache/**/*": true,
"/home/coder/.conda/**/*": true,
"/home/coder/.local/share/**/*": true,
"/home/coder/.vscode-server/**/*": true
},
"search.exclude": {
"**/build/**": true,
"**/_skbuild/**": true,
"**/*.code-search": true,
"/home/coder/.aws/**/*": true,
"/home/coder/.cache/**/*": true,
"/home/coder/.conda/**/*": true,
"/home/coder/.local/share/**/*": true,
"/home/coder/.vscode-server/**/*": true
}
}
}
}
Loading

0 comments on commit 1b75250

Please sign in to comment.