OCPNODE-4043: Add DRA e2e tests to run on NVIDIA GPU#30758
OCPNODE-4043: Add DRA e2e tests to run on NVIDIA GPU#30758sairameshv wants to merge 3 commits intoopenshift:mainfrom
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
@sairameshv: This pull request references OCPNODE-4043 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: sairameshv The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@sairameshv: This pull request references OCPNODE-4043 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
12d663b to
1808b00
Compare
|
Scheduling required tests: |
|
/retest |
- README.md doc with a detailed description of running these tests along with installing the pre-requisites that helps the manual validation Signed-off-by: Sai Ramesh Vanka <svanka@redhat.com>
Implements comprehensive E2E tests for NVIDIA Dynamic Resource Allocation (DRA) on OpenShift clusters with GPU nodes. - Skip the test for non-GPU clusters - Automated prerequisite installation (GPU Operator + DRA Driver) - Single GPU allocation tests - Multi-GPU workload tests(Skips on a single GPU setup) - Resource lifecycle validation Tested on: OCP 4.21.0, Kubernetes 1.34.2, Tesla T4 GPU" Signed-off-by: Sai Ramesh Vanka <svanka@redhat.com>
1808b00 to
c7c003a
Compare
|
Scheduling required tests: |
|
|
||
| # Install GPU Operator with OpenShift-specific settings | ||
| # This is exactly what prerequisites_installer.go does | ||
| helm install gpu-operator nvidia/gpu-operator \ |
There was a problem hiding this comment.
should we install gpu operator using operator framework instead of helm? I'm pretty sure this is what we'll suggest our customers do
There was a problem hiding this comment.
I think we can follow either of the installation approaches i.e. via olm or via helm. Or we can take up this step as part of the openshift/release repo as a pre-requisite after installing a cluster with GPUand before running these test cases so that we don't have to deal with any change in the process of the GPU Operator installation.
WDYT?
There was a problem hiding this comment.
I like the approach of step-registry so this can be reused and referenced in other release jobs.
There was a problem hiding this comment.
For lws we do use OLM to install it.
Same for JobSet.
This works well but it would be nice to consult with gpu-operator team so they can own this step registry.
There was a problem hiding this comment.
We did have the step during the instaslice initial e2e versions, later I guess the installation part was moved to the operator code. Yeah, it didn't have any issues.
Okay, I would remove the GPU Operator installation part from here and try to incorporate it as part of the openshift/release step-registry.
I would like to keep the DRA driver installation here so that
- We would be able to test if the driver can be installed before running the tests
- With the GPU operator as a pre-req,
openshift-tests run-test ..can directly test the DRA features
|
|
||
| # Install NVIDIA DRA driver via Helm | ||
| # ⚠️ CRITICAL: nvidiaDriverRoot MUST be /run/nvidia/driver (NOT /) | ||
| helm install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \ |
There was a problem hiding this comment.
is this automatically the latest version then?
There was a problem hiding this comment.
Yes, IIRC, with no "--version" flag, this installs the latest version
| - sairameshv | ||
| - harche | ||
| - haircommander | ||
| - mrunalp |
| - sairameshv | ||
| - harche | ||
| - haircommander | ||
| - mrunalp |
|
Scheduling required tests: |
- Defer automatic installation of the NVIDIA GPU Operator from prerequisite installation so that the same can be installed via OLM - Point to official NVIDIA docs for installing GPU operator on Openshift Signed-off-by: Sai Ramesh Vanka <svanka@redhat.com>
5031be3 to
2b7c3c3
Compare
|
Scheduling required tests: |
|
@sairameshv: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Add NVIDIA DRA E2E tests for OpenShift
Implements comprehensive E2E tests for NVIDIA Dynamic Resource Allocation (DRA) on OpenShift clusters with GPU nodes.
Tested on: OCP 4.21.0, Kubernetes 1.34.2, Tesla T4 GPU"