Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia-driver: detection of nvidia devcies #1364

Closed
heilerich opened this issue Feb 16, 2024 · 5 comments
Closed

nvidia-driver: detection of nvidia devcies #1364

heilerich opened this issue Feb 16, 2024 · 5 comments
Labels
area/gpu channel/alpha Issue concerns the Alpha channel. channel/beta Issue concerns the Beta channel. channel/stable Issue concerns the Stable channel. kind/bug Something isn't working

Comments

@heilerich
Copy link

Description

The nvidia.service uses lspci | grep -i "${NVIDIA_PRODUCT_TYPE}")" here to detect the presence of nvidia devices. Not all Nvidia GPUs include their product type in the name that lspci outputs.

Changing the product type to something else that matches the devices breaks the download link for the driver files.

Impact

Nvidia driver will not be installed on machines that need the driver.

Expected behavior

Maybe just use lspci | grep -i NVIDIA. This is already happening in other places in the install script.

Additional information

Flatcar 3815.2.0

@jepio
Copy link
Member

jepio commented Feb 22, 2024

Make sense.

Could you share examples of NVIDIA_PRODUCT_TYPE where this is needed?

@jepio jepio added area/gpu channel/alpha Issue concerns the Alpha channel. channel/beta Issue concerns the Beta channel. channel/stable Issue concerns the Stable channel. labels Feb 22, 2024
@heilerich
Copy link
Author

I think we might need two variables, one for the 'driver type' and one for identifying the devices. Currently, the NVIDIA_PRODUCT_TYPE variable is used for both.

For example:

  • We have a setup that uses RTX Datacenter GPUs such as RTX 6000, Quadro and the like. There we would have to grep the lspci output for something like 'NVIDIA' or 'RTX'. Those devices use the tesla driver.
  • There are many other cards out there, where this is true. The only system I have access to right now from this computer in fact uses a Tesla L40 but still shows up as 3D controller: NVIDIA Corporation AD104GL [L4] (rev a1) in lspci.
  • Some of our Datacenter GPUs show up with the string 'tesla' in lspci and can use the tesla driver. In this case the current script works. But if one would have to use the open driver e.g. for some legal reason, one would still have to grep for 'tesla' but use 'XFree86' in the driver URL.

So maybe two variables NVIDIA_DEVICE_FILTER and NVIDIA_DRIVER_TYPE are more sensible?

@jepio
Copy link
Member

jepio commented Feb 23, 2024

Can you check if the proposed function catches all of the cases that you listed above:

function is_nvidia_probe_required() {
  # Vendor: NVIDIA, Class: VGA compatible controller
  if [[ -n "$(lspci -d 10de:*:0300)" ]]; then
    return 0
  fi
  # Vendor: NVIDIA, Class: 3D controller
  if [[ -n "$(lspci -d 10de:*:0302)" ]]; then
    return 0
  fi
  return 1
}
is_nvidia_probe_required && echo "correct"

That should be generic enough and leave the user to only have to worry about the driver type to deploy.

@heilerich
Copy link
Author

I have tested on systems containing RTX 2000, RTX 4000, RTX 6000 series consumer grade and datacenter GPUs as well as with Tesla A100 and L40 GPUs. Worked everywhere.

@jepio
Copy link
Member

jepio commented Feb 27, 2024

I'll close this since it's going to be part of the next set of releases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/gpu channel/alpha Issue concerns the Alpha channel. channel/beta Issue concerns the Beta channel. channel/stable Issue concerns the Stable channel. kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants