Skip to content
This repository has been archived by the owner on Oct 24, 2023. It is now read-only.

chore: Upgrade nvidia drivers to 450.80.02 #4376

Merged
merged 2 commits into from
Apr 27, 2021
Merged

chore: Upgrade nvidia drivers to 450.80.02 #4376

merged 2 commits into from
Apr 27, 2021

Conversation

gdippolito
Copy link
Contributor

Reason for Change:

Upgrade the Nvidia drivers for GPU machine to 450.80.02.

This version of the drivers is the same used in AzureML based images since April 2021. This change will add compatibility with newer version of CUDA (11.1,11.2,11.3). Reference https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html

Issue Fixed:

None. It adds support to run docker images based on CUDA >= 11.1

Credit Where Due:

Does this change contain code from or inspired by another project?

  • No
  • Yes

If "Yes," did you notify that project's maintainers and provide attribution?

  • No
  • Yes

Requirements:

Notes:

Should be pretty safe since it is a minor version upgrade.

@welcome
Copy link

welcome bot commented Apr 21, 2021

💖 Thanks for opening your first pull request! 💖 We use semantic commit messages to streamline the release process. Before your pull request can be merged, you should make sure your first commit and PR title start with a semantic prefix. Examples of commit messages with semantic prefixes: - fix: change azure disk cachingMode to ReadOnly - feat: make maximumLoadBalancerRuleCount configurable - docs: add note on AKS Engine and AKS relationship
Make sure to check out the developer guide for guidance on testing your change.

@ghost
Copy link

ghost commented Apr 21, 2021

CLA assistant check
All CLA requirements met.

@jackfrancis
Copy link
Member

Thanks @gdippolito, is there a newer version of basic CUDA validation we can run in E2E that you know of? We're still running the v0.1 image from 4+ years ago, see:

https://github.com/Azure/aks-engine/blob/master/test/e2e/kubernetes/workloads/cuda-vector-add.yaml

@gdippolito
Copy link
Contributor Author

Thanks @gdippolito, is there a newer version of basic CUDA validation we can run in E2E that you know of? We're still running the v0.1 image from 4+ years ago, see:

https://github.com/Azure/aks-engine/blob/master/test/e2e/kubernetes/workloads/cuda-vector-add.yaml

Thanks for your comment @jackfrancis. I have updated the e2e test with a more recent image which uses CUDA 11.1.

I have done few tests locally and this sample image does not seem to work with the version of the drivers I picked (it works on 460.x but not on 450.80.02). I'm a bit confused since according to the Nvidia documentation this is expected to work.

@jackfrancis
Copy link
Member

@gdippolito thanks! I've tested your new image, and the E2E tests pass fine. So I think we're good here.

cc @Michael-Sinz @jadarsie are there any concerns with the following:

  • updating the driver version
  • testing nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.1-ubuntu18.04 instead of k8s.gcr.io/cuda-vector-add:v0.1

To be clear, these driver changes are specific to docker-backed clusters. If you're using containerd you can't use these drivers, and are probably using the gpu-operator already if you've figured out how to make this work with aks-engine-created clusters.

@Michael-Sinz
Copy link
Collaborator

Michael-Sinz commented Apr 22, 2021 via email

@jadarsie
Copy link
Member

@gdippolito thanks! I've tested your new image, and the E2E tests pass fine. So I think we're good here.

cc @Michael-Sinz @jadarsie are there any concerns with the following:

* updating the driver version

* testing `nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.1-ubuntu18.04` instead of `k8s.gcr.io/cuda-vector-add:v0.1`

To be clear, these driver changes are specific to docker-backed clusters. If you're using containerd you can't use these drivers, and are probably using the gpu-operator already if you've figured out how to make this work with aks-engine-created clusters.

FYI @penggu

Copy link
Member

@jackfrancis jackfrancis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@jackfrancis
Copy link
Member

/azp run pr-e2e

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@acs-bot
Copy link

acs-bot commented Apr 23, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gdippolito, jackfrancis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@codecov
Copy link

codecov bot commented Apr 23, 2021

Codecov Report

Merging #4376 (109b346) into master (60828c7) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #4376   +/-   ##
=======================================
  Coverage   72.04%   72.04%           
=======================================
  Files         141      141           
  Lines       21631    21631           
=======================================
  Hits        15584    15584           
  Misses       5096     5096           
  Partials      951      951           
Impacted Files Coverage Δ
pkg/api/common/versions.go 96.37% <ø> (ø)
pkg/engine/templates_generated.go 43.31% <ø> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e6f43bc...109b346. Read the comment docs.

@penggu
Copy link
Contributor

penggu commented Apr 26, 2021

Looks good to me.

@gdippolito
Copy link
Contributor Author

Shall we 🚢 it?

@jackfrancis jackfrancis merged commit 86897b5 into Azure:master Apr 27, 2021
@welcome
Copy link

welcome bot commented Apr 27, 2021

Congrats on merging your first pull request! 🎉🎉🎉

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants