Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Cannot provision instances with the A10 GPU on the azure backend #1014

Closed
peterschmidt85 opened this issue Mar 14, 2024 · 3 comments · Fixed by #1150
Closed

[Bug] Cannot provision instances with the A10 GPU on the azure backend #1014

peterschmidt85 opened this issue Mar 14, 2024 · 3 comments · Fixed by #1150
Assignees
Labels

Comments

@peterschmidt85
Copy link
Contributor

Steps to reproduce:

  1. Run dstack run . -b azure --gpu A10 --spot-auto
  • The run hangs for 20-ish minutes and then fails
  • The server throws an error
  • The instance remains as Running
@r4victor
Copy link
Collaborator

@r4victor
Copy link
Collaborator

I believe this is the reason:

The Azure NVads A10 v5 VMs only support GRID 14.1(510.73) or higher driver versions.

Azure requires NVIDIA GRID drivers for NV-series, but dstack only installs CUDA drivers.

https://learn.microsoft.com/en-us/azure/virtual-machines/linux/n-series-driver-setup

@peterschmidt85
Copy link
Contributor Author

peterschmidt85 commented Mar 20, 2024

Steps to reproduce the problem

  1. Install the normal CUDA driver (as we do it in our packer script)
  2. Reboot the instance
    It hands forever on reboot
  • Reproduced

Potential fix

  1. Instead of the normal CUDA driver, install the GRID driver (e.g. by following Azure's guide)
  • Reproduced

Implementation notes

The Azure's guide is using 535.154.05 version and it seems to be compatible with what we use (535.54.03-1). Where to download the exact 535.54.03-1 Grid driver – I couldn't find it. Because both versions seem to be compatible, we theoretically could go with Azure's guide instructions.

It seems that for A10 for Azure, we need to build a separate AMI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants