-
Notifications
You must be signed in to change notification settings - Fork 12
Fast GPU Provisioning Technology
Fast GPU Provisioning technology enables GPU provisioning less than 1 second with no reboots using pre-built driver containers. The feature eliminates any dependency on machine configuration which triggers reboot, an expensive operation. Instead, the required operations are performed at runtime. This leads to a simplified and accelerated deployment process.
In order to achieve this in the 1.2.1 release, 2 KMM features were leveraged. The first is setting the firmware search path on the fly which is required to load the out of tree firmware binaries on RHCOS. KMM writes the alternative firmware search path to sysfs right before loading the out-of-tree drivers. Second, KMM removes the in-tree intel_vsec driver on the fly which is required prior to loading the out-of-tree equivalent. Previously, both of these operations required a machine configuration which triggered unnecessary reboots.
Reboot is a costly operation and adds several minutes to the provisioning process. In many cases, especially in production after Day 2, reboot is not an option. Going from minutes to seconds, this feature enables faster GPU provisioning without any reboots by performing configuration changes at runtime, a welcome change. It is important to note that this feature is especially useful for SNO cluster setups to avoid SNO downtime.
The Intel Technology Enabling for OpenShift project provides Intel Data Center hardware feature-provisioning technologies with the Red Hat OpenShift Container Platform (RHOCP). The technology to deploy and manage Intel Enterprise AI End-to-End (E2E) solutions and the related reference workloads for these features are also included in the project.
Fast GPU Provisioning technology enables GPU provisioning less than 1 second with no reboots using pre-built driver containers. The feature eliminates any dependency on machine configuration which triggers reboot, an expensive operation. Instead, the required operations are performed at runtime. This leads to a simplified and accelerated deployment process.
When the containers need access to device files they usually need to run as root UID/GID as 0/0. But when the device plugins make the device files available to the workload containers, it is owned by root and thus the containers need to run as root. But it is not a good security practice. So its always a good idea to run containers as rootless. Here is short tutorial on how to run the Intel Device plugins so they the workload containers can run as rootless. By default this is not turned on.