Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CA-370578 use subsystemId in NVidia GPU matching #4826

Merged
merged 1 commit into from
Oct 19, 2022

Conversation

lindig
Copy link
Contributor

@lindig lindig commented Oct 18, 2022

So far, we identified a GPU by its deviceId; this turns out to be ambiguous and vgpuConfig.xml may contain multiple GPUs with identical deviceId, causing us to use more vGPU types than we can actually support.

This patch introduces a new abstract type "gpu" in the VENDOR signature such that its Nvidia instance can use more information to identify a GPU. This is the core of this patch and fix.

When matching a GPU found on the PCI bus with GPUs listed in vgpuConfig.xml we use:

  deviceId match and subsystemId match

or deviceId match and subsystemId in vgpuConfig is 0

subsystemId=0 is used by Nvidia as a wildcard and thus subsystemId=0 is ignored during a match.

The matching for other vendors remains unchanged (and is still based on deviceId).

Signed-off-by: Christian Lindig christian.lindig@citrix.com

@lindig
Copy link
Contributor Author

lindig commented Oct 18, 2022

This applies cleanly to the Yangtze branch for a potential backport.

Copy link
Contributor

@edwintorok edwintorok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't change the IDL or anything in the DB, so should be ok to backport.

ocaml/xapi/xapi_vgpu_type.ml Show resolved Hide resolved
@lindig
Copy link
Contributor Author

lindig commented Oct 18, 2022

Added comments as suggested. Will squash once accepted.

So far, we identified a GPU by its deviceId; this turns out to be
ambiguous and vgpuConfig.xml may contain multiple GPUs with identical
deviceId, causing us to use more vGPU types than we can actually
support.

This patch introduces a new abstract type "gpu" in the VENDOR signature
such that its Nvidia instance can use more information to identify a
GPU.

When matching a GPU found on the PCI bus with GPUs listed in
vgpuConfig.xml we use:

      deviceId match and subsystemId match
  or  deviceId match and subsystemId in vgpuConfig is 0

subsystemId=0 is used by Nvidia as a wildcard and thus subsystemId=0 is
ignored during a match.

The matching for other vendors remains unchanged (and is still based on
deviceId).

Signed-off-by: Christian Lindig <christian.lindig@citrix.com>
@lindig lindig force-pushed the private/christianlin/CA-370578 branch from 48a75b9 to eb855fb Compare October 18, 2022 14:57
@@ -410,6 +413,16 @@ module Vendor_nvidia = struct
; compatible_model_names_on_pgpu: string list
}

type gpu = {device_id: int; subsys_id: int}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we don't need the subsystem vendor id at this point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do need it for Nvidia - that is the entire point of the PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The is a subsystem device ID, and a subsystem vendor ID as well, which is not in the type now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. We don't use the subsystem vendor ID right now but could easily extend this to use it as well. We would expect it to match exactly (no wildcard mechanism).

@lindig lindig merged commit b6f17fd into xapi-project:master Oct 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants