Skip to content

gpu: copy nfdhook functionality to gpu-plugin #1492

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Sep 12, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 3 additions & 67 deletions cmd/gpu_nfdhook/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ Table of Contents

## Introduction

***NOTE:*** NFD's binary hook support will be turned off by default in the 0.14 release. The functionality in the GPU NFD hook is moved into a new NFD rule and into GPU plugin, and the capability labels are being removed completely. The GPU plugin deployment doesn't anymore support using init container. This directory will be removed in the future.

This is the [Node Feature Discovery](https://github.com/kubernetes-sigs/node-feature-discovery)
binary hook implementation for the Intel GPUs. The intel-gpu-initcontainer (which
is built with the other images) can be used as part of the gpu-plugin deployment
Expand All @@ -23,70 +25,4 @@ by the NFD, allowing for finer grained resource management for GPU-using PODs.

In the NFD deployment, the hook requires `/host-sys` -folder to have the host `/sys`-folder content mounted. Write access is not necessary.

## GPU memory

GPU memory amount is read from sysfs `gt/gt*` files and turned into a label.
There are two supported environment variables named `GPU_MEMORY_OVERRIDE` and
`GPU_MEMORY_RESERVED`. Both are supposed to hold numeric byte amounts. For systems with
older kernel drivers or GPUs which do not support reading the GPU memory
amount, the `GPU_MEMORY_OVERRIDE` environment variable value is turned into a GPU
memory amount label instead of a read value. `GPU_MEMORY_RESERVED` value will be
scoped out from the GPU memory amount found from sysfs.

## Default labels

Following labels are created by default. You may turn numeric labels into extended resources with NFD.

name | type | description|
-----|------|------|
|`gpu.intel.com/millicores`| number | node GPU count * 1000. Can be used as a finer grained shared execution fraction.
|`gpu.intel.com/memory.max`| number | sum of detected [GPU memory amounts](#gpu-memory) in bytes OR environment variable value * GPU count
|`gpu.intel.com/cards`| string | list of card names separated by '`.`'. The names match host `card*`-folders under `/sys/class/drm/`. Deprecated, use `gpu-numbers`.
|`gpu.intel.com/gpu-numbers`| string | list of numbers separated by '`.`'. The numbers correspond to device file numbers for the primary nodes of given GPUs in kernel DRI subsystem, listed as `/dev/dri/card<num>` in devfs, and `/sys/class/drm/card<num>` in sysfs.
|`gpu.intel.com/tiles`| number | sum of all detected GPU tiles in the system.
|`gpu.intel.com/numa-gpu-map`| string | list of numa node to gpu mappings.

If the value of the `gpu-numbers` label would not fit into the 63 character length limit, you will also get labels `gpu-numbers2`,
`gpu-numbers3`... until all the gpu numbers have been labeled.

The tile count `gpu.intel.com/tiles` describes the total amount of tiles on the system. System is expected to be homogeneous, and thus the number of tiles per GPU can be calculated by dividing the tile count with GPU count.

The `numa-gpu-map` label is a list of numa to gpu mapping items separated by `_`. Each list item has a numa node id combined with a list of gpu indices. e.g. 0-1.2.3 would mean: numa node 0 has gpus 1, 2 and 3. More complex example would be: 0-0.1_1-3.4 where numa node 0 would have gpus 0 and 1, and numa node 1 would have gpus 3 and 4. As with `gpu-numbers`, this label will be extended to multiple labels if the length of the value exceeds the max label length.

## PCI-groups (optional)

GPUs which share the same pci paths under `/sys/devices/pci*` can be grouped into a label. GPU nums are separated by '`.`' and
groups are separated by '`_`'. The label is created only if environment variable named `GPU_PCI_GROUPING_LEVEL` has a value greater
than zero. GPUs are considered to belong to the same group, if as many identical folder names are found for the GPUs, as is the value
of the environment variable. Counting starts from the folder name which starts with `pci`.

For example, the SG1 card has 4 GPUs, which end up sharing pci-folder names under `/sys/devices`. With a `GPU_PCI_GROUPING_LEVEL`
of 3, a node with two such SG1 cards could produce a `pci-groups` label with a value of `0.1.2.3_4.5.6.7`.

name | type | description|
-----|------|------|
|`gpu.intel.com/pci-groups`| string | list of pci-groups separated by '`_`'. GPU numbers in the groups are separated by '`.`'. The numbers correspond to device file numbers for the primary nodes of given GPUs in kernel DRI subsystem, listed as `/dev/dri/card<num>` in devfs, and `/sys/class/drm/card<num>` in sysfs.

If the value of the `pci-groups` label would not fit into the 63 character length limit, you will also get labels `pci-groups2`,
`pci-groups3`... until all the pci groups have been labeled.

## Capability labels (optional)

Capability labels are created from information found inside debugfs, and therefore
unfortunately require running the NFD worker as root. Due to coming from debugfs,
which is not guaranteed to be stable, these are not guaranteed to be stable either.
If you do not need these, simply do not run NFD worker as root, that is also more secure.
Depending on your kernel driver, running the NFD hook as root may introduce following labels:

name | type | description|
-----|------|------|
|`gpu.intel.com/platform_gen`| string | GPU platform generation name, typically an integer. Deprecated.
|`gpu.intel.com/media_version`| string | GPU platform Media pipeline generation name, typically a number. Deprecated.
|`gpu.intel.com/graphics_version`| string | GPU platform graphics/compute pipeline generation name, typically a number. Deprecated.
|`gpu.intel.com/platform_<PLATFORM_NAME>.count`| number | GPU count for the named platform.
|`gpu.intel.com/platform_<PLATFORM_NAME>.tiles`| number | GPU tile count in the GPUs of the named platform.
|`gpu.intel.com/platform_<PLATFORM_NAME>.present`| string | "true" for indicating the presense of the GPU platform.

## Limitations

For the above to work as intended, GPUs on the same node must be identical in their capabilities.
For detailed info about the labels created by the NFD hook, see the [labels documentation](../gpu_plugin/labels.md).
19 changes: 4 additions & 15 deletions cmd/gpu_nfdhook/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -15,25 +15,14 @@
package main

import (
"os"

"k8s.io/klog/v2"
"github.com/intel/intel-device-plugins-for-kubernetes/cmd/internal/labeler"
)

const (
sysfsDirectory = "/host-sys"
sysfsDRMDirectory = sysfsDirectory + "/class/drm"
debugfsDRIDirectory = sysfsDirectory + "/kernel/debug/dri"
sysfsDirectory = "/host-sys"
sysfsDRMDirectory = sysfsDirectory + "/class/drm"
)

func main() {
l := newLabeler(sysfsDRMDirectory, debugfsDRIDirectory)

err := l.createLabels()
if err != nil {
klog.Errorf("%+v", err)
os.Exit(1)
}

l.printLabels()
labeler.CreateAndPrintLabels(sysfsDRMDirectory)
}
4 changes: 4 additions & 0 deletions cmd/gpu_plugin/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ Table of Contents
* [Fractional resources details](#fractional-resources-details)
* [Verify Plugin Registration](#verify-plugin-registration)
* [Testing and Demos](#testing-and-demos)
* [Labels created by GPU plugin](#labels-created-by-gpu-plugin)
* [Issues with media workloads on multi-GPU setups](#issues-with-media-workloads-on-multi-gpu-setups)
* [Workaround for QSV and VA-API](#workaround-for-qsv-and-va-api)

Expand Down Expand Up @@ -339,6 +340,9 @@ The GPU plugin functionality can be verified by deploying an [OpenCL image](../.
Warning FailedScheduling <unknown> default-scheduler 0/1 nodes are available: 1 Insufficient gpu.intel.com/i915.
```

## Labels created by GPU plugin

If installed with NFD and started with resource-management, plugin will export a set of labels for the node. For detailed info, see [labeling documentation](./labels.md).

## Issues with media workloads on multi-GPU setups

Expand Down
35 changes: 32 additions & 3 deletions cmd/gpu_plugin/gpu_plugin.go
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
// Copyright 2017-2022 Intel Corporation. All Rights Reserved.
// Copyright 2017-2023 Intel Corporation. All Rights Reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -31,13 +31,16 @@ import (
pluginapi "k8s.io/kubelet/pkg/apis/deviceplugin/v1beta1"

"github.com/intel/intel-device-plugins-for-kubernetes/cmd/gpu_plugin/rm"
"github.com/intel/intel-device-plugins-for-kubernetes/cmd/internal/labeler"
"github.com/intel/intel-device-plugins-for-kubernetes/cmd/internal/pluginutils"
dpapi "github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin"
)

const (
sysfsDrmDirectory = "/sys/class/drm"
devfsDriDirectory = "/dev/dri"
nfdFeatureDir = "/etc/kubernetes/node-feature-discovery/features.d"
resourceFilename = "intel-gpu-resources.txt"
gpuDeviceRE = `^card[0-9]+$`
controlDeviceRE = `^controlD[0-9]+$`
pciAddressRE = "^[0-9a-f]{4}:[0-9a-f]{2}:[0-9a-f]{2}\\.[0-9a-f]{1}$"
Expand All @@ -53,6 +56,9 @@ const (

// Period of device scans.
scanPeriod = 5 * time.Second

// Labeler's max update interval, 5min.
labelerMaxInterval = 5 * 60 * time.Second
)

type cliOptions struct {
Expand Down Expand Up @@ -242,8 +248,9 @@ type devicePlugin struct {
controlDeviceReg *regexp.Regexp
pciAddressReg *regexp.Regexp

scanTicker *time.Ticker
scanDone chan bool
scanTicker *time.Ticker
scanDone chan bool
scanResources chan bool

resMan rm.ResourceManager

Expand All @@ -270,6 +277,7 @@ func newDevicePlugin(sysfsDir, devfsDir string, options cliOptions) *devicePlugi
scanTicker: time.NewTicker(scanPeriod),
scanDone: make(chan bool, 1), // buffered as we may send to it before Scan starts receiving from it
bypathFound: true,
scanResources: make(chan bool, 1),
}

if options.resourceManagement {
Expand Down Expand Up @@ -347,17 +355,26 @@ func (dp *devicePlugin) Scan(notifier dpapi.Notifier) error {
klog.Warning("Failed to scan: ", err)
}

countChanged := false

for name, prev := range previousCount {
count := devTree.DeviceTypeCount(name)
if count != prev {
klog.V(1).Infof("GPU scan update: %d->%d '%s' resources found", prev, count, name)

previousCount[name] = count

countChanged = true
}
}

notifier.Notify(devTree)

// Trigger resource scan if it's enabled.
if dp.resMan != nil && countChanged {
dp.scanResources <- true
}

select {
case <-dp.scanDone:
return nil
Expand Down Expand Up @@ -515,6 +532,18 @@ func main() {
klog.V(1).Infof("GPU device plugin started with %s preferred allocation policy", opts.preferredAllocationPolicy)

plugin := newDevicePlugin(prefix+sysfsDrmDirectory, prefix+devfsDriDirectory, opts)

if plugin.options.resourceManagement {
// Start labeler to export labels file for NFD.
nfdFeatureFile := path.Join(nfdFeatureDir, resourceFilename)

klog.V(2).Infof("NFD feature file location: %s", nfdFeatureFile)

// Labeler catches OS signals and calls os.Exit() after receiving any.
go labeler.Run(prefix+sysfsDrmDirectory, nfdFeatureFile,
labelerMaxInterval, plugin.scanResources)
}

manager := dpapi.NewManager(namespace, plugin)
manager.Run()
}
2 changes: 1 addition & 1 deletion cmd/gpu_plugin/gpu_plugin_test.go
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
// Copyright 2017-2021 Intel Corporation. All Rights Reserved.
// Copyright 2017-2023 Intel Corporation. All Rights Reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
Expand Down
87 changes: 87 additions & 0 deletions cmd/gpu_plugin/labels.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# Labels

GPU labels originate from two main sources: NFD rules and GPU plugin (& NFD hook).

## NFD rules

NFD rule is a method to instruct NFD to add certain label(s) to node based on the devices detected on it. There is a generic rule to identify all Intel GPUs. It will add labels for each PCI device type. For example, a Tigerlake iGPU (PCI Id 0x9a49) will show up as:

```
gpu.intel.com/device-id.0300-9a49.count=1
gpu.intel.com/device-id.0300-9a49.present=true
```

For data center GPUs, there are more specific rules which will create additional labels for GPU family, product and device count. For example, Flex 170:
```
gpu.intel.com/device.count=1
gpu.intel.com/family=Flex_Series
gpu.intel.com/product=Flex_170
```

For MAX 1550:
```
gpu.intel.com/device.count=2
gpu.intel.com/family=Max_Series
gpu.intel.com/product=Max_1550
```

Current covered platforms/devices are: Flex 140, Flex 170, Max 1100 and Max 1550.

To identify other GPUs, see the graphics processor table [here](https://dgpu-docs.intel.com/devices/hardware-table.html#graphics-processor-table).

## GPU Plugin and NFD hook

In GPU plugin, these labels are only applied when [Resource Management](README.md#fractional-resources-details) is enabled. With the NFD hook, labels are created regardless of how GPU plugin is configured.

Numeric labels are converted into extended resources for the node (with NFD) and other labels are used directly by [GPU Aware Scheduling (GAS)](https://github.com/intel/platform-aware-scheduling/tree/master/gpu-aware-scheduling). Extended resources should only be used with GAS as Kubernetes scheduler doesn't properly handle resource allocations with multiple GPUs.

### Default labels

Following labels are created by default.

name | type | description|
-----|------|------|
|`gpu.intel.com/millicores`| number | node GPU count * 1000.
|`gpu.intel.com/memory.max`| number | sum of detected [GPU memory amounts](#gpu-memory) in bytes OR environment variable value * GPU count
|`gpu.intel.com/cards`| string | list of card names separated by '`.`'. The names match host `card*`-folders under `/sys/class/drm/`. Deprecated, use `gpu-numbers`.
|`gpu.intel.com/gpu-numbers`| string | list of numbers separated by '`.`'. The numbers correspond to device file numbers for the primary nodes of given GPUs in kernel DRI subsystem, listed as `/dev/dri/card<num>` in devfs, and `/sys/class/drm/card<num>` in sysfs.
|`gpu.intel.com/tiles`| number | sum of all detected GPU tiles in the system.
|`gpu.intel.com/numa-gpu-map`| string | list of numa node to gpu mappings.

If the value of the `gpu-numbers` label would not fit into the 63 character length limit, you will also get labels `gpu-numbers2`,
`gpu-numbers3`... until all the gpu numbers have been labeled.

The tile count `gpu.intel.com/tiles` describes the total amount of tiles on the system. System is expected to be homogeneous, and thus the number of tiles per GPU can be calculated by dividing the tile count with GPU count.

The `numa-gpu-map` label is a list of numa to gpu mapping items separated by `_`. Each list item has a numa node id combined with a list of gpu indices. e.g. 0-1.2.3 would mean: numa node 0 has gpus 1, 2 and 3. More complex example would be: 0-0.1_1-3.4 where numa node 0 would have gpus 0 and 1, and numa node 1 would have gpus 3 and 4. As with `gpu-numbers`, this label will be extended to multiple labels if the length of the value exceeds the max label length.

### PCI-groups (optional)

GPUs which share the same pci paths under `/sys/devices/pci*` can be grouped into a label. GPU nums are separated by '`.`' and
groups are separated by '`_`'. The label is created only if environment variable named `GPU_PCI_GROUPING_LEVEL` has a value greater
than zero. GPUs are considered to belong to the same group, if as many identical folder names are found for the GPUs, as is the value
of the environment variable. Counting starts from the folder name which starts with `pci`.

For example, the SG1 card has 4 GPUs, which end up sharing pci-folder names under `/sys/devices`. With a `GPU_PCI_GROUPING_LEVEL`
of 3, a node with two such SG1 cards could produce a `pci-groups` label with a value of `0.1.2.3_4.5.6.7`.

name | type | description|
-----|------|------|
|`gpu.intel.com/pci-groups`| string | list of pci-groups separated by '`_`'. GPU numbers in the groups are separated by '`.`'. The numbers correspond to device file numbers for the primary nodes of given GPUs in kernel DRI subsystem, listed as `/dev/dri/card<num>` in devfs, and `/sys/class/drm/card<num>` in sysfs.

If the value of the `pci-groups` label would not fit into the 63 character length limit, you will also get labels `pci-groups2`,
`pci-groups3`... until all the pci groups have been labeled.

### Limitations

For the above to work as intended, GPUs on the same node must be identical in their capabilities.

### GPU memory

GPU memory amount is read from sysfs `gt/gt*` files and turned into a label.
There are two supported environment variables named `GPU_MEMORY_OVERRIDE` and
`GPU_MEMORY_RESERVED`. Both are supposed to hold numeric byte amounts. For systems with
older kernel drivers or GPUs which do not support reading the GPU memory
amount, the `GPU_MEMORY_OVERRIDE` environment variable value is turned into a GPU
memory amount label instead of a read value. `GPU_MEMORY_RESERVED` value will be
scoped out from the GPU memory amount found from sysfs.
2 changes: 1 addition & 1 deletion cmd/gpu_plugin/render-device.sh
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#!/bin/sh
#
# Copyright 2021 Intel Corporation.
# Copyright 2021-2023 Intel Corporation.
#
# SPDX-License-Identifier: Apache-2.0
#
Expand Down
4 changes: 2 additions & 2 deletions cmd/gpu_plugin/rm/gpu_plugin_resource_manager.go
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
// Copyright 2021 Intel Corporation. All Rights Reserved.
// Copyright 2021-2023 Intel Corporation. All Rights Reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -337,7 +337,7 @@ func (rm *resourceManager) listPods() (*v1.PodList, error) {
for i := 0; i < kubeletAPIMaxRetries; i++ {
if podList, err := rm.listPodsFromKubelet(); err == nil {
return podList, nil
} else if errors.As(err, neterr); neterr.Timeout() {
} else if errors.As(err, &neterr) && neterr.Timeout() {
continue
}

Expand Down
2 changes: 1 addition & 1 deletion cmd/gpu_plugin/rm/gpu_plugin_resource_manager_test.go
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
// Copyright 2021 Intel Corporation. All Rights Reserved.
// Copyright 2021-2023 Intel Corporation. All Rights Reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
Expand Down
Loading