Skip to content

OCPNODE-4043: Add DRA e2e tests to run on NVIDIA GPU#30758

Open
sairameshv wants to merge 3 commits intoopenshift:mainfrom
sairameshv:nvidia_dra_ocp
Open

OCPNODE-4043: Add DRA e2e tests to run on NVIDIA GPU#30758
sairameshv wants to merge 3 commits intoopenshift:mainfrom
sairameshv:nvidia_dra_ocp

Conversation

@sairameshv
Copy link
Member

@sairameshv sairameshv commented Feb 4, 2026

Add NVIDIA DRA E2E tests for OpenShift
Implements comprehensive E2E tests for NVIDIA Dynamic Resource Allocation (DRA) on OpenShift clusters with GPU nodes.

  • Skip the tests for non-GPU clusters
  • Automated prerequisite installation (GPU Operator + DRA Driver)
  • Single GPU allocation tests
  • Multi-GPU workload tests(Skips on a single GPU setup)
  • Resource lifecycle validation
  • README.md doc explaining the execution of e2e tests along with the installation of the pre-reqs

Tested on: OCP 4.21.0, Kubernetes 1.34.2, Tesla T4 GPU"

@openshift-ci-robot
Copy link

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 4, 2026
@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 4, 2026

@sairameshv: This pull request references OCPNODE-4043 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Add NVIDIA DRA E2E tests for OpenShift
Implements comprehensive E2E tests for NVIDIA Dynamic Resource Allocation (DRA) on OpenShift clusters with GPU nodes.

  • Skip the test for non-GPU clusters
  • Automated prerequisite installation (GPU Operator + DRA Driver)
  • Single GPU allocation tests
  • Multi-GPU workload tests(Skips on a single GPU setup)
  • Resource lifecycle validation
  • README.md doc explaining the execution of e2e tests along with the installation of the pre-reqs

Tested on: OCP 4.21.0, Kubernetes 1.34.2, Tesla T4 GPU"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 4, 2026
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 4, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 4, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sairameshv
Once this PR has been reviewed and has the lgtm label, please assign bertinatto for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 4, 2026

@sairameshv: This pull request references OCPNODE-4043 which is a valid jira issue.

Details

In response to this:

Add NVIDIA DRA E2E tests for OpenShift
Implements comprehensive E2E tests for NVIDIA Dynamic Resource Allocation (DRA) on OpenShift clusters with GPU nodes.

  • Skip the tests for non-GPU clusters
  • Automated prerequisite installation (GPU Operator + DRA Driver)
  • Single GPU allocation tests
  • Multi-GPU workload tests(Skips on a single GPU setup)
  • Resource lifecycle validation
  • README.md doc explaining the execution of e2e tests along with the installation of the pre-reqs

Tested on: OCP 4.21.0, Kubernetes 1.34.2, Tesla T4 GPU"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sairameshv sairameshv marked this pull request as ready for review February 11, 2026 06:06
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 11, 2026
@openshift-ci-robot
Copy link

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@sairameshv
Copy link
Member Author

/retest

- README.md doc with a detailed description of running these tests along
  with installing the pre-requisites that helps the manual validation

Signed-off-by: Sai Ramesh Vanka <svanka@redhat.com>
Implements comprehensive E2E tests for NVIDIA Dynamic Resource Allocation (DRA) on OpenShift clusters with GPU nodes.

- Skip the test for non-GPU clusters
- Automated prerequisite installation (GPU Operator + DRA Driver)
- Single GPU allocation tests
- Multi-GPU workload tests(Skips on a single GPU setup)
- Resource lifecycle validation

Tested on: OCP 4.21.0, Kubernetes 1.34.2, Tesla T4 GPU"

Signed-off-by: Sai Ramesh Vanka <svanka@redhat.com>
@openshift-ci-robot
Copy link

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi


# Install GPU Operator with OpenShift-specific settings
# This is exactly what prerequisites_installer.go does
helm install gpu-operator nvidia/gpu-operator \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we install gpu operator using operator framework instead of helm? I'm pretty sure this is what we'll suggest our customers do

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree to this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can follow either of the installation approaches i.e. via olm or via helm. Or we can take up this step as part of the openshift/release repo as a pre-requisite after installing a cluster with GPUand before running these test cases so that we don't have to deal with any change in the process of the GPU Operator installation.

WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the approach of step-registry so this can be reused and referenced in other release jobs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For lws we do use OLM to install it.

Same for JobSet.

This works well but it would be nice to consult with gpu-operator team so they can own this step registry.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did have the step during the instaslice initial e2e versions, later I guess the installation part was moved to the operator code. Yeah, it didn't have any issues.

Okay, I would remove the GPU Operator installation part from here and try to incorporate it as part of the openshift/release step-registry.

I would like to keep the DRA driver installation here so that

  • We would be able to test if the driver can be installed before running the tests
  • With the GPU operator as a pre-req, openshift-tests run-test .. can directly test the DRA features


# Install NVIDIA DRA driver via Helm
# ⚠️ CRITICAL: nvidiaDriverRoot MUST be /run/nvidia/driver (NOT /)
helm install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this automatically the latest version then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, IIRC, with no "--version" flag, this installs the latest version

- sairameshv
- harche
- haircommander
- mrunalp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • rphillips

- sairameshv
- harche
- haircommander
- mrunalp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • rphillips

@openshift-ci-robot
Copy link

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

- Defer automatic installation of the NVIDIA GPU Operator from
  prerequisite installation so that the same can be installed via OLM
- Point to official NVIDIA docs for installing GPU operator on Openshift

Signed-off-by: Sai Ramesh Vanka <svanka@redhat.com>
@openshift-ci-robot
Copy link

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 17, 2026

@sairameshv: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-vsphere-ovn-upi 2b7c3c3 link true /test e2e-vsphere-ovn-upi
ci/prow/e2e-aws-ovn-serial-1of2 2b7c3c3 link true /test e2e-aws-ovn-serial-1of2

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants