Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugfix: Allow for optional TimeSlicing configuration #1018

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

danielkleinstein
Copy link

This fixes a vestigial bug from the introduction of MPS. Now that two timeslicing configurations are available, it's possible for MPS to be configured instead of timeslicing.

But since the TimeSlicing field was made non-optional - its existence is forced when unmarshalling, and then the parsing fails because no resources are specified under the empty TimeSlicing field.

Copy link

copy-pr-bot bot commented Oct 27, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@danielkleinstein danielkleinstein force-pushed the bugfix/fix-optional-timeslicing branch from 1c6e3fe to e6ec8d3 Compare October 27, 2024 17:05
api/config/v1/config.go Outdated Show resolved Hide resolved
api/config/v1/sharing.go Outdated Show resolved Hide resolved
Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @danielkleinstein.

I have made some suggestions for minor cleanups. I understand that the way this was done probably mirrors the same checks done for MPS, so I'm also ok to merge this as-is and then drop the redundant == nil checks in a follow-up (or as a second commit on top of the one here).

@danielkleinstein
Copy link
Author

@elezar Thanks for the quick review 🙂 I removed the unnecessary nil checks you commented on in a separate commit.

@danielkleinstein danielkleinstein force-pushed the bugfix/fix-optional-timeslicing branch from f0505c9 to d1bb887 Compare October 28, 2024 13:49
@elezar
Copy link
Member

elezar commented Oct 28, 2024

@danielkleinstein the email address on your second commit is different to the first. Please update.

@danielkleinstein danielkleinstein force-pushed the bugfix/fix-optional-timeslicing branch 2 times, most recently from db0e5d8 to 1e8822c Compare October 28, 2024 13:52
@danielkleinstein
Copy link
Author

@elezar Fixed

@danielkleinstein danielkleinstein force-pushed the bugfix/fix-optional-timeslicing branch 3 times, most recently from eadb4e2 to 2c80a56 Compare October 28, 2024 13:56
@elezar elezar enabled auto-merge October 28, 2024 14:09
auto-merge was automatically disabled October 28, 2024 14:19

Head branch was pushed to by a user without write access

@danielkleinstein danielkleinstein force-pushed the bugfix/fix-optional-timeslicing branch from 2c80a56 to 75eb65b Compare October 28, 2024 14:19
elezar
elezar previously requested changes Oct 28, 2024
Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please run make check and / or make test locally if possible:

diff --git a/internal/lm/mig-strategy_test.go b/internal/lm/mig-strategy_test.go
index 005fc961..5aa7b9f6 100644
--- a/internal/lm/mig-strategy_test.go
+++ b/internal/lm/mig-strategy_test.go
@@ -30,7 +30,7 @@ func TestMigStrategyNoneLabels(t *testing.T) {
 	testCases := []struct {
 		description    string
 		devices        []resource.Device
-		timeSlicing    spec.ReplicatedResources
+		timeSlicing    *spec.ReplicatedResources
 		expectedError  bool
 		expectedLabels Labels
 	}{
diff --git a/internal/lm/resource_test.go b/internal/lm/resource_test.go
index 3f180188..3f44763c 100644
--- a/internal/lm/resource_test.go
+++ b/internal/lm/resource_test.go
@@ -271,7 +271,7 @@ func TestMigResourceLabeler(t *testing.T) {
 		description    string
 		resourceName   spec.ResourceName
 		count          int
-		timeSlicing    spec.ReplicatedResources
+		timeSlicing    *spec.ReplicatedResources
 		expectedLabels Labels
 	}{
 		{
@@ -301,7 +301,7 @@ func TestMigResourceLabeler(t *testing.T) {
 			description:  "shared appends suffix and doubles count",
 			resourceName: "nvidia.com/gpu",
 			count:        1,
-			timeSlicing: spec.ReplicatedResources{
+			timeSlicing: &spec.ReplicatedResources{
 				Resources: []spec.ReplicatedResource{
 					{
 						Name:     "nvidia.com/gpu",
@@ -329,7 +329,7 @@ func TestMigResourceLabeler(t *testing.T) {
 			description:  "renamed does not append suffix and doubles count",
 			resourceName: "nvidia.com/gpu",
 			count:        1,
-			timeSlicing: spec.ReplicatedResources{
+			timeSlicing: &spec.ReplicatedResources{
 				Resources: []spec.ReplicatedResource{
 					{
 						Name:     "nvidia.com/gpu",
@@ -358,7 +358,7 @@ func TestMigResourceLabeler(t *testing.T) {
 			description:  "mig mixed appends shared",
 			resourceName: "nvidia.com/mig-1g.1gb",
 			count:        1,
-			timeSlicing: spec.ReplicatedResources{
+			timeSlicing: &spec.ReplicatedResources{
 				Resources: []spec.ReplicatedResource{
 					{
 						Name:     "nvidia.com/gpu",
@@ -391,7 +391,7 @@ func TestMigResourceLabeler(t *testing.T) {
 			description:  "mig mixed rename does not append",
 			resourceName: "nvidia.com/mig-1g.1gb",
 			count:        1,
-			timeSlicing: spec.ReplicatedResources{
+			timeSlicing: &spec.ReplicatedResources{
 				Resources: []spec.ReplicatedResource{
 					{
 						Name:     "nvidia.com/mig-1g.1gb",

@elezar elezar dismissed their stale review October 28, 2024 14:22

Changes applied.

@elezar elezar enabled auto-merge October 28, 2024 14:22
@danielkleinstein
Copy link
Author

danielkleinstein commented Oct 28, 2024

@elezar Sorry, was kind of hoping to use the project's CI to run the tests 👻 I was working from a temporary Github Codespaces and make test didn't run so well. I'll clone locally and make sure tests pass before I push again.

@elezar
Copy link
Member

elezar commented Oct 28, 2024

@elezar Sorry, was kind of hoping to use the project's CI to run the tests 👻 I was working from a temporary Github Codespaces and make test didn't run so well. I'll clone locally and make sure tests pass before I push again.

We require an opt-in to run tests. (Still working on improving the configurations here).

diff --git a/internal/lm/resource_test.go b/internal/lm/resource_test.go
index 3f180188..6a3d7ae7 100644
--- a/internal/lm/resource_test.go
+++ b/internal/lm/resource_test.go
@@ -422,7 +422,7 @@ func TestMigResourceLabeler(t *testing.T) {
 		t.Run(tc.description, func(t *testing.T) {
 			config := &spec.Config{
 				Sharing: spec.Sharing{
-					TimeSlicing: tc.timeSlicing,
+					TimeSlicing: &tc.timeSlicing,
 				},
 			}
 			l, err := NewMIGResourceLabeler(tc.resourceName, config, device, tc.count)

auto-merge was automatically disabled October 28, 2024 14:37

Head branch was pushed to by a user without write access

@danielkleinstein danielkleinstein force-pushed the bugfix/fix-optional-timeslicing branch 2 times, most recently from 39063d1 to eb95c30 Compare October 28, 2024 14:38
danielkleinstein and others added 3 commits October 28, 2024 16:39
This fixes a vestigial bug from the introduction of MPS. Now that two timeslicing configurations are available, it's possible for MPS to be configured instead of timeslicing.

But since the TimeSlicing field was made non-optional - its existence is forced when unmarshalling, and then the parsing fails because no resources are specified under the empty TimeSlicing field.

Signed-off-by: Daniel Kleinstein <daniel.kleinstein@gmail.com>
Signed-off-by: Daniel Kleinstein <daniel.kleinstein@gmail.com>
…y is defined

Signed-off-by: Daniel Kleinstein <daniel@scaleops.com>
@danielkleinstein danielkleinstein force-pushed the bugfix/fix-optional-timeslicing branch from eb95c30 to 72b9e0b Compare October 28, 2024 14:39
@danielkleinstein
Copy link
Author

danielkleinstein commented Oct 28, 2024

@elezar I need some guidance - in a separate commit, I changed sharingDisabled from:

func (rl resourceLabeler) sharingDisabled() bool {
	return rl.sharing == nil
}

To:

func (rl resourceLabeler) sharingDisabled() bool {
    return rl.sharing == nil || (rl.sharing.SharingStrategy() == spec.SharingStrategyNone)
}

Which was necessary to get TestRunOneshot to pass. But this causes TestMigResourceLabeler to fail, because in the first test case it expects nvidia.com/gpu.replicas: 1, but it gets nvidia.com/gpu.replicas: 0, because of this code:

func (rl resourceLabeler) getReplicas() int {
	if rl.sharingDisabled() {
		return 0
	} else if r := rl.replicationInfo(); r != nil && r.Replicas > 0 {
		return r.Replicas
	}
	return 1
}

What's strange to me is that in the first test case, sharing seems to indeed be disabled (because nvidia.com/gpu.sharing-strategy: none) - so it seems like the number of replicas really should be 0 - but the expected label is 1?


An alternative fix for TestRunOneshot - which wouldn't trigger a failure in TestMigResourceLabeler, is to change replicationInfo from:

func (rl resourceLabeler) replicationInfo() *spec.ReplicatedResource {
	if rl.sharingDisabled() {
		return nil
	}
	for _, r := range rl.sharing.ReplicatedResources().Resources {
		if r.Name == rl.resourceName {
			return &r
		}
	}
	return nil
}

To:

func (rl resourceLabeler) replicationInfo() *spec.ReplicatedResource {
	if rl.sharingDisabled() {
		return nil
	}

	rr := rl.sharing.ReplicatedResources()
	if rr == nil {
		return nil
	}

	for _, r := range rr.Resources {
		if r.Name == rl.resourceName {
			return &r
		}
	}
	return nil
}

But I'm not sure if this is a desirable change.

@elezar elezar self-assigned this Oct 28, 2024
@elezar
Copy link
Member

elezar commented Nov 20, 2024

In the existing implementation, the sharing == nil code path should only be triggered when calling NewGPUResourceLabelerWithoutSharing. This explicitly sets the replicas label to 0 and does not add other sharing labels. It has been on my TODO list for a while to rework this to be clearer, but we have not had time to focus on it.

What about removing the last commit that attempts to address this and making the following change:

diff --git a/api/config/v1/sharing.go b/api/config/v1/sharing.go
index ab768112..0c35cf30 100644
--- a/api/config/v1/sharing.go
+++ b/api/config/v1/sharing.go
@@ -34,10 +34,9 @@ const (
 
 // SharingStrategy returns the active sharing strategy.
 func (s *Sharing) SharingStrategy() SharingStrategy {
-	if s.MPS != nil && s.MPS.isReplicated() {
+	if s.MPS.isReplicated() {
 		return SharingStrategyMPS
 	}
-
 	if s.TimeSlicing.isReplicated() {
 		return SharingStrategyTimeSlicing
 	}
@@ -46,8 +45,16 @@ func (s *Sharing) SharingStrategy() SharingStrategy {
 
 // ReplicatedResources returns the resources associated with the active sharing strategy.
 func (s *Sharing) ReplicatedResources() *ReplicatedResources {
-	if s.MPS != nil {
+	none := &ReplicatedResources{}
+	if s == nil {
+		return none
+	}
+	switch s.SharingStrategy() {
+	case SharingStrategyMPS:
 		return s.MPS
+	case SharingStrategyTimeSlicing:
+		return s.TimeSlicing
+	default:
+		return none
 	}
-	return s.TimeSlicing
 }

This ensures that Sharing.ReplicatedResources() always returns a valid pointer and that an empty set of resources are returned when required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants