Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defaulting before validation cause operator panic #1628

Closed
henrysecond1 opened this issue Jul 1, 2022 · 2 comments
Closed

Defaulting before validation cause operator panic #1628

henrysecond1 opened this issue Jul 1, 2022 · 2 comments

Comments

@henrysecond1
Copy link
Contributor

henrysecond1 commented Jul 1, 2022

Reproduce

Deploy below Pytorch job

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: invalid-pytorchjob
spec:
  pytorchReplicaSpecs:
    Worker: {}

Operator fails with below log

E0701 05:54:38.872167       1 runtime.go:79] Observed a panic: runtime.boundsError{x:0, y:0, signed:true, code:0x0} (runtime error: index out of range [0] with length 0)
goroutine 718 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x19dce20, 0xc000557e48})
        /go/pkg/mod/k8s.io/apimachinery@v0.24.1/pkg/util/runtime/runtime.go:75 +0x85
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x4})
        /go/pkg/mod/k8s.io/apimachinery@v0.24.1/pkg/util/runtime/runtime.go:49 +0x75
panic({0x19dce20, 0xc000557e48})
        /usr/local/go/src/runtime/panic.go:1038 +0x215
github.com/kubeflow/training-operator/pkg/apis/kubeflow.org/v1.hasDefaultPort(0xc0003c3790, 0x1adf8fb, {0x1ae97e9, 0x0})
        /workspace/pkg/apis/kubeflow.org/v1/defaulting_utils.go:21 +0x10c
github.com/kubeflow/training-operator/pkg/apis/kubeflow.org/v1.setPytorchDefaultPort(0xc00014aa00)
        /workspace/pkg/apis/kubeflow.org/v1/pytorch_defaults.go:31 +0x65
github.com/kubeflow/training-operator/pkg/apis/kubeflow.org/v1.SetDefaults_PyTorchJob(0xc000216888)
        /workspace/pkg/apis/kubeflow.org/v1/pytorch_defaults.go:80 +0xdb
github.com/kubeflow/training-operator/pkg/apis/kubeflow.org/v1.SetObjectDefaults_PyTorchJob(...)
        /workspace/pkg/apis/kubeflow.org/v1/zz_generated.defaults.go:66
github.com/kubeflow/training-operator/pkg/apis/kubeflow.org/v1.RegisterDefaults.func5({0x1aa0aa0, 0xc000216888})
        /workspace/pkg/apis/kubeflow.org/v1/zz_generated.defaults.go:34 +0x32
k8s.io/apimachinery/pkg/runtime.(*Scheme).Default(0xc000c43c78, {0x1d34cb0, 0xc000216888})
        /go/pkg/mod/k8s.io/apimachinery@v0.24.1/pkg/runtime/scheme.go:346 +0xa4
github.com/kubeflow/training-operator/pkg/controller.v1/pytorch.(*PyTorchJobReconciler).onOwnerCreateFunc.func1({{0x1da38a0, 0xc000216888}})
        /workspace/pkg/controller.v1/pytorch/pytorchjob_controller.go:530 +0x6a
sigs.k8s.io/controller-runtime/pkg/predicate.Funcs.Create(...)
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.1/pkg/predicate/predicate.go:72
sigs.k8s.io/controller-runtime/pkg/source/internal.EventHandler.OnAdd({{0x1d4e268, 0x2b9b6e0}, {0x1d8af68, 0xc00076a1e0}, {0xc00078f410, 0x1, 0x1}}, {0x1aa0aa0, 0xc000216888})
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.1/pkg/source/internal/eventsource.go:57 +0x28b
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
        /go/pkg/mod/k8s.io/client-go@v0.24.1/tools/cache/shared_informer.go:818 +0x9f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x7f3954bd4b88)
        /go/pkg/mod/k8s.io/apimachinery@v0.24.1/pkg/util/wait/wait.go:155 +0x67
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000710738, {0x1d1a8a0, 0xc0003990e0}, 0x1, 0xc000560c60)
        /go/pkg/mod/k8s.io/apimachinery@v0.24.1/pkg/util/wait/wait.go:156 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00005cbd0, 0x3b9aca00, 0x0, 0x0, 0x1259b80)
        /go/pkg/mod/k8s.io/apimachinery@v0.24.1/pkg/util/wait/wait.go:133 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(...)
        /go/pkg/mod/k8s.io/apimachinery@v0.24.1/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/cache.(*processorListener).run(0xc000c82180)
        /go/pkg/mod/k8s.io/client-go@v0.24.1/tools/cache/shared_informer.go:812 +0x6b
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
        /go/pkg/mod/k8s.io/apimachinery@v0.24.1/pkg/util/wait/wait.go:73 +0x5a
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start
        /go/pkg/mod/k8s.io/apimachinery@v0.24.1/pkg/util/wait/wait.go:71 +0x88
panic: runtime error: index out of range [0] with length 0 [recovered]
        panic: runtime error: index out of range [0] with length 0

goroutine 718 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x4})
        /go/pkg/mod/k8s.io/apimachinery@v0.24.1/pkg/util/runtime/runtime.go:56 +0xd8
panic({0x19dce20, 0xc000557e48})
        /usr/local/go/src/runtime/panic.go:1038 +0x215
github.com/kubeflow/training-operator/pkg/apis/kubeflow.org/v1.hasDefaultPort(0xc0003c3790, 0x1adf8fb, {0x1ae97e9, 0x0})
        /workspace/pkg/apis/kubeflow.org/v1/defaulting_utils.go:21 +0x10c
github.com/kubeflow/training-operator/pkg/apis/kubeflow.org/v1.setPytorchDefaultPort(0xc00014aa00)
        /workspace/pkg/apis/kubeflow.org/v1/pytorch_defaults.go:31 +0x65
github.com/kubeflow/training-operator/pkg/apis/kubeflow.org/v1.SetDefaults_PyTorchJob(0xc000216888)
        /workspace/pkg/apis/kubeflow.org/v1/pytorch_defaults.go:80 +0xdb
github.com/kubeflow/training-operator/pkg/apis/kubeflow.org/v1.SetObjectDefaults_PyTorchJob(...)
        /workspace/pkg/apis/kubeflow.org/v1/zz_generated.defaults.go:66
github.com/kubeflow/training-operator/pkg/apis/kubeflow.org/v1.RegisterDefaults.func5({0x1aa0aa0, 0xc000216888})
        /workspace/pkg/apis/kubeflow.org/v1/zz_generated.defaults.go:34 +0x32
k8s.io/apimachinery/pkg/runtime.(*Scheme).Default(0xc000c43c78, {0x1d34cb0, 0xc000216888})
        /go/pkg/mod/k8s.io/apimachinery@v0.24.1/pkg/runtime/scheme.go:346 +0xa4
github.com/kubeflow/training-operator/pkg/controller.v1/pytorch.(*PyTorchJobReconciler).onOwnerCreateFunc.func1({{0x1da38a0, 0xc000216888}})
        /workspace/pkg/controller.v1/pytorch/pytorchjob_controller.go:530 +0x6a
sigs.k8s.io/controller-runtime/pkg/predicate.Funcs.Create(...)
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.1/pkg/predicate/predicate.go:72
sigs.k8s.io/controller-runtime/pkg/source/internal.EventHandler.OnAdd({{0x1d4e268, 0x2b9b6e0}, {0x1d8af68, 0xc00076a1e0}, {0xc00078f410, 0x1, 0x1}}, {0x1aa0aa0, 0xc000216888})
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.1/pkg/source/internal/eventsource.go:57 +0x28b
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
        /go/pkg/mod/k8s.io/client-go@v0.24.1/tools/cache/shared_informer.go:818 +0x9f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x7f3954bd4b88)
        /go/pkg/mod/k8s.io/apimachinery@v0.24.1/pkg/util/wait/wait.go:155 +0x67
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000710738, {0x1d1a8a0, 0xc0003990e0}, 0x1, 0xc000560c60)
        /go/pkg/mod/k8s.io/apimachinery@v0.24.1/pkg/util/wait/wait.go:156 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00005cbd0, 0x3b9aca00, 0x0, 0x0, 0x1259b80)
        /go/pkg/mod/k8s.io/apimachinery@v0.24.1/pkg/util/wait/wait.go:133 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(...)
        /go/pkg/mod/k8s.io/apimachinery@v0.24.1/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/cache.(*processorListener).run(0xc000c82180)
        /go/pkg/mod/k8s.io/client-go@v0.24.1/tools/cache/shared_informer.go:812 +0x6b
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
        /go/pkg/mod/k8s.io/apimachinery@v0.24.1/pkg/util/wait/wait.go:73 +0x5a
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start
        /go/pkg/mod/k8s.io/apimachinery@v0.24.1/pkg/util/wait/wait.go:71 +0x88

Analysis

Default() called before validation in onOwnerCreateFunc.

It assumes a container exists, so panic occurs with index out of range.

Possible solution

Default() is called after validation in Reconcile function in every controller so I think we can remove Default() in onOwnerCreateFunc (uploaded PR for this #1629).
Also, as #1564 addressed, the controller should immediately return after validation fails.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions
Copy link

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant