Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unified define k8s-driver-manager image info in values.yaml #1032

Closed
wants to merge 293 commits into from

Conversation

lengrongfu
Copy link
Contributor

Fixes: #642

@tariq1890
Copy link
Contributor

Hi @lengrongfu , thanks for your contribution! Can you rebase this PR?

Copy link

copy-pr-bot bot commented Nov 23, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

elezar and others added 20 commits December 2, 2024 22:18
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Tariq Ibrahim <tibrahim@nvidia.com>
Signed-off-by: Tariq Ibrahim <tibrahim@nvidia.com>
Signed-off-by: Tariq Ibrahim <tibrahim@nvidia.com>
Signed-off-by: Christopher Desiniotis <cdesiniotis@nvidia.com>
…e to Role

Signed-off-by: Tariq Ibrahim <tibrahim@nvidia.com>
Signed-off-by: Tariq Ibrahim <tibrahim@nvidia.com>
Signed-off-by: Christopher Desiniotis <cdesiniotis@nvidia.com>
Signed-off-by: Christopher Desiniotis <cdesiniotis@nvidia.com>
Bumps [golangci/golangci-lint-action](https://github.com/golangci/golangci-lint-action) from 5 to 6.
- [Release notes](https://github.com/golangci/golangci-lint-action/releases)
- [Commits](golangci/golangci-lint-action@v5...v6)

---
updated-dependencies:
- dependency-name: golangci/golangci-lint-action
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps nvidia/cuda from 12.4.1-base-ubi8 to 12.5.0-base-ubi8.

---
updated-dependencies:
- dependency-name: nvidia/cuda
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps nvidia/cuda from 12.4.1-base-ubi8 to 12.5.0-base-ubi8.

---
updated-dependencies:
- dependency-name: nvidia/cuda
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Christopher Desiniotis <cdesiniotis@nvidia.com>
Signed-off-by: Tariq Ibrahim <tibrahim@nvidia.com>
…ripts

This commit updates the driver validation to always create a 'driver-ready' file,
regardless if the driver is installed on the host or not. It also populates this file
with a list of environment variables, one per line, which are required by some operands.

The startup scripts for several operands are simplified to simply source the content
of this file before executing the main program for the container.

Signed-off-by: Christopher Desiniotis <cdesiniotis@nvidia.com>
…naged by GPU Operator

The commit updates our driver validator to only check the presence of .driver-ctr.ready,
a file created by our driver daemonset readiness probe, if the driver container
is managed by GPU Operator. This allows us to support non-standard environments, like
GKE, where a driver container is deployed but not managed by GPU Operator.

Signed-off-by: Christopher Desiniotis <cdesiniotis@nvidia.com>
This commit updates our driver validator code to only chroot when
validating a host installed driver. When validating a driver container
install, we discover the paths to libnvidia-ml.so.1 and nvidia-smi at
the driver container root and then run 'LD_PRELOAD=/driverRoot/path/to/libnvidia-ml.so.1 nvidia-smi'.

This sets the stage for validating driver container installs where driverRoot
does not represent a full filesystem hiearchy that one can 'chroot' into.

Signed-off-by: Christopher Desiniotis <cdesiniotis@nvidia.com>
RootFS represents the path to the root filesystem of the host.
This is used by components that need to interact with the host filesystem
and as such this must be a chroot-able filesystem.
Examples include the MIG Manager and Toolkit Container which may need to
stop, start, or restart systemd services.

Signed-off-by: Christopher Desiniotis <cdesiniotis@nvidia.com>
Co-authored-by: Angelos Kolaitis <neoaggelos@gmail.com>
…and containers except the driver-validator

Having a static path inside our containers will make it easier when driverRoot is a configurable field.
If driverRoot is set to a custom path, we can transform the host path for the volume while keeping
the container path unchanged.

The driver-validation initContainer is the exception to this rule. From the driver validation
initContainer, the container path must match the host path otherwise the /dev/char symlinks will not
resolve correctly on the host. The target of the symlinks must correspond to the path of the device
nodes on the host. For example, when the NVIDIA device nodes are present under `/run/nvidia/driver/dev`
on the host, running the following command from inside the container would create an invalid symlink:

  ln -s /driver-root/dev/nvidiactl /host-dev-char/195:255

while running the below command from inside the container would create a valid symlink:

  ln -s /run/nvidia/driver/dev/nvidiactl /host-dev-char/195:255

Signed-off-by: Christopher Desiniotis <cdesiniotis@nvidia.com>
This allows for non-standard driver container installations, where
the driver installation path and device nodes are rooted at paths
other than '/run/nvidia/driver'.

Note, setting driverInstallDir to a custom value is currently
only supported for driver container installations not managed by
by GPU Operator. For example, in the GKE use case where a driver
daemonset is deployed prior to installing GPU Operator and the GPU
Operator managed driver is disabled.

The GPU Operator's driver container daemonset still assumes that
the full driver installation is made available at '/run/nvidia/driver'
on the host, and consequently, we always mount '/run/nvidia/driver'
into the GPU Operator managed daemonset. We may consider removing this
assumption in the future and support driver container implementations
which allow for a custom driverInstallDir to be specified.

Signed-off-by: Christopher Desiniotis <cdesiniotis@nvidia.com>
tariq1890 and others added 28 commits December 2, 2024 22:19
Signed-off-by: Tariq Ibrahim <tibrahim@nvidia.com>
Signed-off-by: Tariq Ibrahim <tibrahim@nvidia.com>
Signed-off-by: Tariq Ibrahim <tibrahim@nvidia.com>
Signed-off-by: Tariq Ibrahim <tibrahim@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Tariq Ibrahim <tibrahim@nvidia.com>
…toring

Bumps [github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring](https://github.com/prometheus-operator/prometheus-operator) from 0.76.2 to 0.78.1.
- [Release notes](https://github.com/prometheus-operator/prometheus-operator/releases)
- [Changelog](https://github.com/prometheus-operator/prometheus-operator/blob/main/CHANGELOG.md)
- [Commits](prometheus-operator/prometheus-operator@v0.76.2...v0.78.1)

---
updated-dependencies:
- dependency-name: github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Tariq Ibrahim <tibrahim@nvidia.com>
Signed-off-by: Tariq Ibrahim <tibrahim@nvidia.com>
Bumps [github.com/NVIDIA/nvidia-container-toolkit](https://github.com/NVIDIA/nvidia-container-toolkit) from 1.17.0 to 1.17.2.
- [Release notes](https://github.com/NVIDIA/nvidia-container-toolkit/releases)
- [Changelog](https://github.com/NVIDIA/nvidia-container-toolkit/blob/v1.17.2/CHANGELOG.md)
- [Commits](NVIDIA/nvidia-container-toolkit@v1.17.0...v1.17.2)

---
updated-dependencies:
- dependency-name: github.com/NVIDIA/nvidia-container-toolkit
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Tariq Ibrahim <tibrahim@nvidia.com>
Bumps [github.com/onsi/gomega](https://github.com/onsi/gomega) from 1.35.0 to 1.35.1.
- [Release notes](https://github.com/onsi/gomega/releases)
- [Changelog](https://github.com/onsi/gomega/blob/master/CHANGELOG.md)
- [Commits](onsi/gomega@v1.35.0...v1.35.1)

---
updated-dependencies:
- dependency-name: github.com/onsi/gomega
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Tariq Ibrahim <tibrahim@nvidia.com>
Bumps [github.com/prometheus/client_golang](https://github.com/prometheus/client_golang) from 1.20.4 to 1.20.5.
- [Release notes](https://github.com/prometheus/client_golang/releases)
- [Changelog](https://github.com/prometheus/client_golang/blob/main/CHANGELOG.md)
- [Commits](prometheus/client_golang@v1.20.4...v1.20.5)

---
updated-dependencies:
- dependency-name: github.com/prometheus/client_golang
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Tariq Ibrahim <tibrahim@nvidia.com>
Bumps [golang.org/x/mod](https://github.com/golang/mod) from 0.21.0 to 0.22.0.
- [Commits](golang/mod@v0.21.0...v0.22.0)

---
updated-dependencies:
- dependency-name: golang.org/x/mod
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [github.com/regclient/regclient](https://github.com/regclient/regclient) from 0.7.1 to 0.7.2.
- [Release notes](https://github.com/regclient/regclient/releases)
- [Changelog](https://github.com/regclient/regclient/blob/v0.7.2/release.md)
- [Commits](regclient/regclient@v0.7.1...v0.7.2)

---
updated-dependencies:
- dependency-name: github.com/regclient/regclient
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Tariq Ibrahim <tibrahim@nvidia.com>
…B_ENV

Signed-off-by: Tariq Ibrahim <tibrahim@nvidia.com>
Signed-off-by: Tariq Ibrahim <tibrahim@nvidia.com>
Bumps [sigs.k8s.io/controller-runtime](https://github.com/kubernetes-sigs/controller-runtime) from 0.19.0 to 0.19.1.
- [Release notes](https://github.com/kubernetes-sigs/controller-runtime/releases)
- [Changelog](https://github.com/kubernetes-sigs/controller-runtime/blob/main/RELEASE.md)
- [Commits](kubernetes-sigs/controller-runtime@v0.19.0...v0.19.1)

---
updated-dependencies:
- dependency-name: sigs.k8s.io/controller-runtime
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [sigs.k8s.io/controller-tools](https://github.com/kubernetes-sigs/controller-tools) from 0.16.4 to 0.16.5.
- [Release notes](https://github.com/kubernetes-sigs/controller-tools/releases)
- [Changelog](https://github.com/kubernetes-sigs/controller-tools/blob/main/envtest-releases.yaml)
- [Commits](kubernetes-sigs/controller-tools@v0.16.4...v0.16.5)

---
updated-dependencies:
- dependency-name: sigs.k8s.io/controller-tools
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Tariq Ibrahim <tibrahim@nvidia.com>
Signed-off-by: Tariq Ibrahim <tibrahim@nvidia.com>
Bumps [github.com/onsi/ginkgo/v2](https://github.com/onsi/ginkgo) from 2.21.0 to 2.22.0.
- [Release notes](https://github.com/onsi/ginkgo/releases)
- [Changelog](https://github.com/onsi/ginkgo/blob/master/CHANGELOG.md)
- [Commits](onsi/ginkgo@v2.21.0...v2.22.0)

---
updated-dependencies:
- dependency-name: github.com/onsi/ginkgo/v2
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Tariq Ibrahim <tibrahim@nvidia.com>
Signed-off-by: rongfu.leng <lenronfu@gmail.com>
@lengrongfu lengrongfu force-pushed the feat/unified-version branch from 6890ac9 to b93f0b3 Compare December 2, 2024 14:19
@lengrongfu lengrongfu closed this by deleting the head repository Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

k8s-driver-manager use a unified version