Skip to content

Wrap all opaque types as interfaces#112

Merged
klueska merged 21 commits intoNVIDIA:mainfrom
klueska:wrap-type-as-interfaces
Apr 12, 2024
Merged

Wrap all opaque types as interfaces#112
klueska merged 21 commits intoNVIDIA:mainfrom
klueska:wrap-type-as-interfaces

Conversation

@klueska
Copy link
Collaborator

@klueska klueska commented Apr 10, 2024

No description provided.

@klueska klueska requested a review from elezar April 10, 2024 12:17
@klueska klueska force-pushed the wrap-type-as-interfaces branch from f85ef81 to 7e28d72 Compare April 10, 2024 12:21
Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial change look good. The main question I have is whether to top-level library functions that operate ON other types make sense, or whether we should always use the functions off the types?

pkg/nvml/api.go Outdated
GetVgpuDriverCapabilities(Capability VgpuDriverCapability) (bool, Return)
}

// Define package level methods as aliases to Interface methods of libnvml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have toolking to generate this automatically?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at the moment. It was fairly easy to do though. I just grepped for func (l* library) copied that in and did some search and replace on it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tool has been added in the latest revision.

var DeviceGetComputeRunningProcesses = deviceGetComputeRunningProcesses_v1
var DeviceGetGraphicsRunningProcesses = deviceGetGraphicsRunningProcesses_v1
var DeviceGetMPSComputeRunningProcesses = deviceGetMPSComputeRunningProcesses_v1
var deviceGetComputeRunningProcesses = deviceGetComputeRunningProcesses_v1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a silly question: Can you remind me why we don't have an nvml prefix here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nvml "functions" in this block come from #defines in nvml.h. The underlying versioned functions that get assigned to these variables are generated as part of the bindings with the v1, v2, v3, etc. prefixes and then assigned in updateVersionedSymbols below.

The difference with these functions is that we need to wrap the underlying nvml*_v1, nvml*_v2, etc. functions with some extra, version specific logic. Because of this, we create a level of indirection before assinging them back out to the "unversioned" variable.

pkg/nvml/api.go Outdated
DeviceGetVgpuSchedulerCapabilities(Device Device) (VgpuSchedulerCapabilities, Return)
GpuInstanceGetComputeInstancePossiblePlacements(GpuInstance GpuInstance, Info *ComputeInstanceProfileInfo) ([]ComputeInstancePlacement, Return)
GpuInstanceCreateComputeInstanceWithPlacement(GpuInstance GpuInstance, Info *ComputeInstanceProfileInfo, Placement *ComputeInstancePlacement, ComputeInstance *ComputeInstance) Return
GpuInstanceCreateComputeInstanceWithPlacement(GpuInstance GpuInstance, Info *ComputeInstanceProfileInfo, Placement *ComputeInstancePlacement) (ComputeInstance, Return)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a bug? Does it make sense to do this in a separate PR so as to not have it go "missing" in the changelog?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is a bug. However, if I do it as a separate PR, then I can't make the updates in the commit that follows to turn ComputeInstance into an Interface. I'd rather have all the interface conversion in a single PR. I could look at maybe updating it in a "pre" PR, but I don't know how hard it will be to rebase if I do that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to leave this in this PR as long as we remember to call it out in the release notes once we tag a new version.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will make sure and call it out in the release notes

@klueska klueska force-pushed the wrap-type-as-interfaces branch from 7e28d72 to 81e362c Compare April 10, 2024 13:58
@klueska klueska changed the title Draft: Wrap all opaque types as interfaces Wrap all opaque types as interfaces Apr 10, 2024
@klueska
Copy link
Collaborator Author

klueska commented Apr 11, 2024

Initial change look good. The main question I have is whether to top-level library functions that operate ON other types make sense, or whether we should always use the functions off the types?

I'd be fine returning concrete types for everything (including the top level library) and moving the API definitions to the testing package. I went back and forth on this and don't really have a preference, so long as we have a way to pull in a mock version of the library (and its other types) from somewhere.

@elezar
Copy link
Member

elezar commented Apr 11, 2024

Initial change look good. The main question I have is whether to top-level library functions that operate ON other types make sense, or whether we should always use the functions off the types?

No, that's not what I meant -- although it is a valid question. I meant that we have, as an example, three functions that do the same thing:

DeviceGetComputeMode(device Device)  (ComputeMode, Return)
library.DeviceGetComputeMode(device Device) (ComputeMode, Return)
Device.GetComputeMode()  (ComputeMode, Return)

My question was whether we would consider moving towards requiring the use of Device.GetComputeMode() and at least deprecate the other usages? This would drastically reduce the size of the top-level Interface.

@klueska
Copy link
Collaborator Author

klueska commented Apr 11, 2024

Initial change look good. The main question I have is whether to top-level library functions that operate ON other types make sense, or whether we should always use the functions off the types?

No, that's not what I meant -- although it is a valid question. I meant that we have, as an example, three functions that do the same thing:

DeviceGetComputeMode(device Device)  (ComputeMode, Return)
library.DeviceGetComputeMode(device Device) (ComputeMode, Return)
Device.GetComputeMode()  (ComputeMode, Return)

My question was whether we would consider moving towards requiring the use of Device.GetComputeMode() and at least deprecate the other usages? This would drastically reduce the size of the top-level Interface.

I would be inclined to keep them since that is what matches the actual NVML API. That said, we should add tooling to autogenerate the top level Interface as well as the package level aliases from the functions hanging off the library type.

@klueska klueska force-pushed the wrap-type-as-interfaces branch 3 times, most recently from 8cfa643 to c4bbc89 Compare April 12, 2024 09:40
Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments on the generation toolking to start with.


func getWriter(outputFile string) (io.Writer, func() error, error) {
if outputFile == "-" {
return os.Stdout, nil, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of nil as the "Closer" can we return:

func() error { return nil }

then we don't need to check for nil at the close site?

Alternatively, we can return a io.WriteCloser and implement a noopCloser that we combine with os.Stdout:

type noopCloser struct {
   io.Writer
}

func (n *noopCloser) Close() error {
   return nil
}

And then our return statement becomes:

return &noopCloser{Writer: os.Stdout}, nil

(and we can just return os.Create(outputFile) directly in the file case).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with the first suggestion

Copy link
Collaborator Author

@klueska klueska Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you know, I am not a fan of "noop code" and prefer to be explicit in the caller (which is why I hesitated even changing this at all), but this is minor, so its fine.

return nil
})
if err != nil {
return nil, fmt.Errorf("walking the embed.FS path: %w\n", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

embed.FS?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover from an old iteration, removing

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@klueska klueska force-pushed the wrap-type-as-interfaces branch 3 times, most recently from 217cd99 to b02ebca Compare April 12, 2024 10:23
elezar
elezar previously approved these changes Apr 12, 2024
Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment about where we put our Mock sytems implementations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a thought. Does it make sense to have the "implementations" under something like mock/system or mock/platform?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What value does the extra level of nesting bring? To me it seems better to say mock.DGXA100Server or mock.A100Device over system.A100Device or platform.*.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be happy with a dgxa100 folder though. Then we could change the names to dgxa100.Server and dgxa100.Device, dgxa100.GpuInstance, etc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this change -- let me know what you think

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What value does the extra level of nesting bring?
It's more about organisation than anything else. The mocks themselves are all generated wherease the systems are implemented.

klueska added 3 commits April 12, 2024 12:40
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
As part of this, redefine all package level methods as methods hanging off of
the 'library' type and create aliases for all package level methods to those
methods hanging off of the default 'libnvml' instance of the 'library' type.

Signed-off-by: Kevin Klues <kklues@nvidia.com>
@klueska klueska force-pushed the wrap-type-as-interfaces branch from 0d26ac8 to 8d282c0 Compare April 12, 2024 12:40
elezar
elezar previously approved these changes Apr 12, 2024
Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @klueska.

I have some local changes in the device plugin that could benefit from the "Mock" implementations already!

elezar
elezar previously approved these changes Apr 12, 2024
@klueska klueska force-pushed the wrap-type-as-interfaces branch 2 times, most recently from 7b57959 to abf8e8e Compare April 12, 2024 17:44
}

// Update the errorStringFunc to point to nvml.ErrorString
errorStringFunc = nvmlErrorString
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to set this to the default once we close the library?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Fixed.

Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Looks good.

Minor comment regarding the error string.

}
return nvmlErrorString(r)
func (l *library) ErrorString(r Return) string {
return r.Error()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We either need to keep the lookup logic as before or reset the default if the library is unloaded.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't leave it as before because in the r.Error() and r.String() methods I don't have access to l. I will update it reset the function on unload.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

pkg/nvml/lib.go Outdated
l.dl = dl.New(o.path, o.flags)
}

func (l *library) GetExtendedInterface() ExtendedInterface {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sold on GetExtendedInterface and would prefer GetExtensions(), but this is not a blocker. I'm happy to keep it as is and revisit this later once we have some usage examples.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated as suggested.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I went for a hybrid -- the interface is called ExtendedInterface to be symmetrical with Interface, but the method call is Extensions() (which is the only actual symbol anyone will see in practice), e.g.:

nvml.Extensions().LookupSymbol("symbol")

or

envml := nvml.Extensions()
envml.LookupSymbol("symbol")

Its possible that the type itself might be embedded in a struct, but I only see that happening in testing (if at all) and I like the symmetry of that with Interface for the core API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds fine. I'm not too concerned about the type name, since as you mention that isn't visible at the call site.

klueska added 18 commits April 12, 2024 20:48
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
A new ComputeInstance should be returned, not passed in by reference

Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
A new GpmSample should be returned, not passed in by reference. As part of
this, add methods to hang of of the GpmSample type (which were missing
previously).

Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
We should eventually expand this package to include a unified mock server like
we have in the go-nvlib testing and mig-parted testing.

Signed-off-by: Kevin Klues <kklues@nvidia.com>
This code was pulled over (mostly) directly from 'mig-parted'

Signed-off-by: Kevin Klues <kklues@nvidia.com>
The methods in this interface represent extensions to the core NVML API that
are only accessible through calling GetExtensions() against the Interface in
use (or at the package level for the default interface).

Signed-off-by: Kevin Klues <kklues@nvidia.com>
@klueska klueska force-pushed the wrap-type-as-interfaces branch from abf8e8e to 1fa43fd Compare April 12, 2024 20:55
Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now!

@klueska klueska merged commit a97d07c into NVIDIA:main Apr 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants