-
Notifications
You must be signed in to change notification settings - Fork 445
implement NRI plugin server to inject management CDI devices #1498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
933aa39 to
446d2f0
Compare
94a00c8 to
cbd8cf9
Compare
Signed-off-by: Tariq Ibrahim <tibrahim@nvidia.com>
cbd8cf9 to
96b61e9
Compare
| // nodeResourceCDIDeviceKey is the prefix of the key used for CDI device annotations. | ||
| nodeResourceCDIDeviceKey = "cdi-devices.noderesource.dev" | ||
| // nriCDIDeviceKey is the prefix of the key used for CDI device annotations. | ||
| nriCDIDeviceKey = "cdi-devices.nri.io" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you comment on why we need two prefixes here? Are these not determined by US? (https://github.com/NVIDIA/gpu-operator/pull/1950/files#diff-e6f52ba1392796db4c79e078d3f1067c50e3bfde9d90f3aaaad3eb3e3f4d84fbR20-R21). Why not only respond to one of them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't. Yes, these should be defined by us and one annotation should be sufficient. This is taken the from example plugin code of containerd/nri
| for _, key := range []string{ | ||
| mainKey + "/container." + ctr, | ||
| oldKey + "/container." + ctr, | ||
| mainKey + "/pod", | ||
| oldKey + "/pod", | ||
| mainKey, | ||
| oldKey, | ||
| } { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having to deal with two keys, could we rather have a single function that we call for each of the keys?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The plan is to definitely deal with just one annotation key for now. This was taken from the example plugin code
| var pluginOpts []nriplugin.Option | ||
| pluginOpts = append(pluginOpts, nriplugin.WithPluginIdx(nriPluginIdx)) | ||
| pluginOpts = append(pluginOpts, nriplugin.WithSocketPath(nriSocketPath)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| var pluginOpts []nriplugin.Option | |
| pluginOpts = append(pluginOpts, nriplugin.WithPluginIdx(nriPluginIdx)) | |
| pluginOpts = append(pluginOpts, nriplugin.WithSocketPath(nriSocketPath)) | |
| pluginOpts := []nriplugin.Option{ | |
| nriplugin.WithPluginIdx(nriPluginIdx), | |
| nriplugin.WithSocketPath(nriSocketPath), | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks
| func (c *ConfigV1) EnableNRI() { | ||
| config := *c.Tree | ||
| config.SetPath([]string{"plugins", nriPluginName, "disable"}, false) | ||
| *c.Tree = config | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| func (c *ConfigV1) EnableNRI() { | |
| config := *c.Tree | |
| config.SetPath([]string{"plugins", nriPluginName, "disable"}, false) | |
| *c.Tree = config | |
| } | |
| func (c *ConfigV1) EnableNRI() { | |
| (*Config)(c).EnableNRI() | |
| } |
| const ( | ||
| defaultConfigVersion = 2 | ||
| defaultRuntimeType = "io.containerd.runc.v2" | ||
| nriPluginName = "io.containerd.nri.v1.nri" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we we only have a single implementation for for EnableNRI, then we can just use this string there directly instead of using a constant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer having a constant though, it makes it a bit more readable. Not a stong opinion however
|
|
||
| "github.com/containerd/nri/pkg/api" | ||
| nriplugin "github.com/containerd/nri/pkg/stub" | ||
| "sigs.k8s.io/yaml" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to use a comma-separated list of devices instead of having to parse the YAML?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I am open to that
|
|
||
| for _, name := range devices { | ||
| a.AddCDIDevice( | ||
| &api.CDIDevice{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I am aware, this introduces restructions on compatible containerd / cri-o versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we've moved to native CDI, the minimum supported containerd is now v1.7. Will that be affected by this change?
| logger logger.Interface | ||
|
|
||
| toolkit *toolkit.Installer | ||
| nriPlugin *nri.Plugin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to embed this here? Does it make sense to just instantiate it when required and use a deferred shutdown?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like:
diff --git a/cmd/nvidia-ctk-installer/main.go b/cmd/nvidia-ctk-installer/main.go
index 39ee1abe..c725ee3e 100644
--- a/cmd/nvidia-ctk-installer/main.go
+++ b/cmd/nvidia-ctk-installer/main.go
@@ -75,15 +75,13 @@ func main() {
type app struct {
logger logger.Interface
- nriPlugin *nri.Plugin
- toolkit *toolkit.Installer
+ toolkit *toolkit.Installer
}
// NewApp creates the CLI app fro the specified options.
func NewApp(logger logger.Interface) *cli.Command {
a := app{
- logger: logger,
- nriPlugin: nri.NewPlugin(logger),
+ logger: logger,
}
return a.build()
}
@@ -229,11 +227,12 @@ func (a *app) Run(ctx context.Context, c *cli.Command, o *options) error {
}
if !o.noDaemon {
- if o.runtimeOptions.EnableNRI {
- if err = a.startNRIPluginServer(ctx, o.runtimeOptions); err != nil {
- a.logger.Errorf("unable to start NRI plugin server: %v", err)
- }
+ nriPlugin, err := a.startNRIPluginServer(ctx, o.runtimeOptions)
+ if err != nil {
+ a.logger.Errorf("unable to start NRI plugin server: %v", err)
}
+ defer nriPlugin.Stop()
+
err = a.waitForSignal()
if err != nil {
return fmt.Errorf("unable to wait for signal: %v", err)
@@ -299,11 +298,15 @@ func (a *app) waitForSignal() error {
return nil
}
-func (a *app) startNRIPluginServer(ctx context.Context, opts runtime.Options) error {
+func (a *app) startNRIPluginServer(ctx context.Context, opts runtime.Options) (*nri.Plugin, error) {
+ if !opts.EnableNRI {
+ return nil, nil
+ }
a.logger.Infof("Starting the NRI Plugin server....")
+ plugin := nri.NewPlugin(a.logger)
retriable := func() error {
- return a.nriPlugin.Start(ctx, opts.NRISocket, opts.NRIPluginIndex)
+ return plugin.Start(ctx, opts.NRISocket, opts.NRIPluginIndex)
}
var err error
for i := 0; i < maxRetryAttempts; i++ {
@@ -318,19 +321,13 @@ func (a *app) startNRIPluginServer(ctx context.Context, opts runtime.Options) er
}
if err != nil {
a.logger.Errorf("Max retries reached %d/%d, aborting", maxRetryAttempts, maxRetryAttempts)
- return err
+ return nil, err
}
- return nil
+ return plugin, nil
}
func (a *app) shutdown(pidFile string) {
a.logger.Infof("Shutting Down")
-
- if a.nriPlugin != nil {
- a.logger.Infof("Stopping NRI plugin server...")
- a.nriPlugin.Stop()
- }
-
err := os.Remove(pidFile)
if err != nil {
a.logger.Warningf("Unable to remove pidfile: %v", err)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, this is much better. Thanks for the suggestion!
No description provided.