Skip to content

Conversation

@tariq1890
Copy link
Contributor

No description provided.

@tariq1890 tariq1890 force-pushed the nri-plugin-server branch 12 times, most recently from 933aa39 to 446d2f0 Compare November 22, 2025 19:07
@tariq1890 tariq1890 force-pushed the nri-plugin-server branch 11 times, most recently from 94a00c8 to cbd8cf9 Compare December 4, 2025 07:20
@tariq1890 tariq1890 marked this pull request as ready for review December 4, 2025 07:44
Signed-off-by: Tariq Ibrahim <tibrahim@nvidia.com>
// nodeResourceCDIDeviceKey is the prefix of the key used for CDI device annotations.
nodeResourceCDIDeviceKey = "cdi-devices.noderesource.dev"
// nriCDIDeviceKey is the prefix of the key used for CDI device annotations.
nriCDIDeviceKey = "cdi-devices.nri.io"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you comment on why we need two prefixes here? Are these not determined by US? (https://github.com/NVIDIA/gpu-operator/pull/1950/files#diff-e6f52ba1392796db4c79e078d3f1067c50e3bfde9d90f3aaaad3eb3e3f4d84fbR20-R21). Why not only respond to one of them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't. Yes, these should be defined by us and one annotation should be sufficient. This is taken the from example plugin code of containerd/nri

Comment on lines +94 to +101
for _, key := range []string{
mainKey + "/container." + ctr,
oldKey + "/container." + ctr,
mainKey + "/pod",
oldKey + "/pod",
mainKey,
oldKey,
} {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having to deal with two keys, could we rather have a single function that we call for each of the keys?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plan is to definitely deal with just one annotation key for now. This was taken from the example plugin code

Comment on lines +128 to +130
var pluginOpts []nriplugin.Option
pluginOpts = append(pluginOpts, nriplugin.WithPluginIdx(nriPluginIdx))
pluginOpts = append(pluginOpts, nriplugin.WithSocketPath(nriSocketPath))
Copy link
Member

@elezar elezar Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
var pluginOpts []nriplugin.Option
pluginOpts = append(pluginOpts, nriplugin.WithPluginIdx(nriPluginIdx))
pluginOpts = append(pluginOpts, nriplugin.WithSocketPath(nriSocketPath))
pluginOpts := []nriplugin.Option{
nriplugin.WithPluginIdx(nriPluginIdx),
nriplugin.WithSocketPath(nriSocketPath),
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

Comment on lines +149 to +153
func (c *ConfigV1) EnableNRI() {
config := *c.Tree
config.SetPath([]string{"plugins", nriPluginName, "disable"}, false)
*c.Tree = config
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (c *ConfigV1) EnableNRI() {
config := *c.Tree
config.SetPath([]string{"plugins", nriPluginName, "disable"}, false)
*c.Tree = config
}
func (c *ConfigV1) EnableNRI() {
(*Config)(c).EnableNRI()
}

const (
defaultConfigVersion = 2
defaultRuntimeType = "io.containerd.runc.v2"
nriPluginName = "io.containerd.nri.v1.nri"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we we only have a single implementation for for EnableNRI, then we can just use this string there directly instead of using a constant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer having a constant though, it makes it a bit more readable. Not a stong opinion however


"github.com/containerd/nri/pkg/api"
nriplugin "github.com/containerd/nri/pkg/stub"
"sigs.k8s.io/yaml"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to use a comma-separated list of devices instead of having to parse the YAML?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I am open to that


for _, name := range devices {
a.AddCDIDevice(
&api.CDIDevice{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I am aware, this introduces restructions on compatible containerd / cri-o versions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we've moved to native CDI, the minimum supported containerd is now v1.7. Will that be affected by this change?

logger logger.Interface

toolkit *toolkit.Installer
nriPlugin *nri.Plugin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to embed this here? Does it make sense to just instantiate it when required and use a deferred shutdown?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like:

diff --git a/cmd/nvidia-ctk-installer/main.go b/cmd/nvidia-ctk-installer/main.go
index 39ee1abe..c725ee3e 100644
--- a/cmd/nvidia-ctk-installer/main.go
+++ b/cmd/nvidia-ctk-installer/main.go
@@ -75,15 +75,13 @@ func main() {
 type app struct {
 	logger logger.Interface
 
-	nriPlugin *nri.Plugin
-	toolkit   *toolkit.Installer
+	toolkit *toolkit.Installer
 }
 
 // NewApp creates the CLI app fro the specified options.
 func NewApp(logger logger.Interface) *cli.Command {
 	a := app{
-		logger:    logger,
-		nriPlugin: nri.NewPlugin(logger),
+		logger: logger,
 	}
 	return a.build()
 }
@@ -229,11 +227,12 @@ func (a *app) Run(ctx context.Context, c *cli.Command, o *options) error {
 	}
 
 	if !o.noDaemon {
-		if o.runtimeOptions.EnableNRI {
-			if err = a.startNRIPluginServer(ctx, o.runtimeOptions); err != nil {
-				a.logger.Errorf("unable to start NRI plugin server: %v", err)
-			}
+		nriPlugin, err := a.startNRIPluginServer(ctx, o.runtimeOptions)
+		if err != nil {
+			a.logger.Errorf("unable to start NRI plugin server: %v", err)
 		}
+		defer nriPlugin.Stop()
+
 		err = a.waitForSignal()
 		if err != nil {
 			return fmt.Errorf("unable to wait for signal: %v", err)
@@ -299,11 +298,15 @@ func (a *app) waitForSignal() error {
 	return nil
 }
 
-func (a *app) startNRIPluginServer(ctx context.Context, opts runtime.Options) error {
+func (a *app) startNRIPluginServer(ctx context.Context, opts runtime.Options) (*nri.Plugin, error) {
+	if !opts.EnableNRI {
+		return nil, nil
+	}
 	a.logger.Infof("Starting the NRI Plugin server....")
 
+	plugin := nri.NewPlugin(a.logger)
 	retriable := func() error {
-		return a.nriPlugin.Start(ctx, opts.NRISocket, opts.NRIPluginIndex)
+		return plugin.Start(ctx, opts.NRISocket, opts.NRIPluginIndex)
 	}
 	var err error
 	for i := 0; i < maxRetryAttempts; i++ {
@@ -318,19 +321,13 @@ func (a *app) startNRIPluginServer(ctx context.Context, opts runtime.Options) er
 	}
 	if err != nil {
 		a.logger.Errorf("Max retries reached %d/%d, aborting", maxRetryAttempts, maxRetryAttempts)
-		return err
+		return nil, err
 	}
-	return nil
+	return plugin, nil
 }
 
 func (a *app) shutdown(pidFile string) {
 	a.logger.Infof("Shutting Down")
-
-	if a.nriPlugin != nil {
-		a.logger.Infof("Stopping NRI plugin server...")
-		a.nriPlugin.Stop()
-	}
-
 	err := os.Remove(pidFile)
 	if err != nil {
 		a.logger.Warningf("Unable to remove pidfile: %v", err)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this is much better. Thanks for the suggestion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants