Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed to create job (job-type: mpi)when use volcano v.4.0 #890

Closed
merryzhou opened this issue Jun 30, 2020 · 11 comments
Closed

failed to create job (job-type: mpi)when use volcano v.4.0 #890

merryzhou opened this issue Jun 30, 2020 · 11 comments
Labels
priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.

Comments

@merryzhou
Copy link
Contributor

Hi everyone,
I am trying to create a mpi type job , but failed, and there are some errors in vc controller log.

volcano version: v0.4.0

job yaml: almost same with https://github.com/volcano-sh/volcano/tree/master/example/kubecon-2019-china/mpi-sample

error log:

E0630 08:41:42.001113       1 job_controller_actions.go:548] Failed to update status of Job default/lm-mpi-job: Job.batch.volcano.sh "lm-mpi-job" is invalid: status.state.lastTransitionTime: Invalid value: "null": status.state.lastTransitionTime in body must be of type string: "null"
I0630 08:41:42.001195       1 panic.go:679] Finished Job <default/lm-mpi-job> initiate
I0630 08:41:42.001200       1 panic.go:679] Finished Job <default/lm-mpi-job> sync up
E0630 08:41:42.001276       1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 431 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x13c30e0, 0x21ba5c0)
	/mnt/go/pkg/mod/k8s.io/apimachinery@v0.16.9-beta.0/pkg/util/runtime/runtime.go:74 +0xa3
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/mnt/go/pkg/mod/k8s.io/apimachinery@v0.16.9-beta.0/pkg/util/runtime/runtime.go:48 +0x82
panic(0x13c30e0, 0x21ba5c0)
	/usr/local/go/src/runtime/panic.go:679 +0x1b2
volcano.sh/volcano/pkg/apis/batch/v1alpha1.(*Job).GetObjectKind(0x0, 0x17c7ac0, 0x0)
	<autogenerated>:1 +0x5
k8s.io/client-go/tools/reference.GetReference(0xc0002f5570, 0x176edc0, 0x0, 0x4dbf0b, 0x7f8eda6fc7e0, 0xc000c78918)
	/mnt/go/pkg/mod/k8s.io/client-go@v0.0.0-20191016111102-bec269661e48/tools/reference/ref.go:59 +0x116
k8s.io/client-go/tools/record.(*recorderImpl).generateEvent(0xc00017ff80, 0x176edc0, 0x0, 0x0, 0xbfb6dc9180116d97, 0x76e438a, 0x21d3d20, 0x1572ce5, 0x7, 0x157844f, ...)
	/mnt/go/pkg/mod/k8s.io/client-go@v0.0.0-20191016111102-bec269661e48/tools/record/event.go:291 +0x5d
k8s.io/client-go/tools/record.(*recorderImpl).Event(0xc00017ff80, 0x176edc0, 0x0, 0x1572ce5, 0x7, 0x157844f, 0xe, 0xc00019e1c0, 0xd2)
	/mnt/go/pkg/mod/k8s.io/client-go@v0.0.0-20191016111102-bec269661e48/tools/record/event.go:313 +0xc2
volcano.sh/volcano/pkg/controllers/job.(*Controller).initiateJob(0xc00045e000, 0x0, 0x0, 0x0, 0x0)
	/mnt/go/src/volcano.sh/volcano/pkg/controllers/job/job_controller_actions.go:152 +0x464
volcano.sh/volcano/pkg/controllers/job.(*Controller).syncJob(0xc00045e000, 0xc0009ba060, 0xc00113e470, 0x0, 0x0)
	/mnt/go/src/volcano.sh/volcano/pkg/controllers/job/job_controller_actions.go:195 +0x3d7
volcano.sh/volcano/pkg/controllers/job/state.(*pendingState).Execute(0xc000bcc000, 0x1572c21, 0x7, 0x2c, 0xc00048bd10)
	/mnt/go/src/volcano.sh/volcano/pkg/controllers/job/state/pending.go:54 +0xbc
volcano.sh/volcano/pkg/controllers/job.(*Controller).processNextReq(0xc00045e000, 0xc000000001, 0x1582200)
	/mnt/go/src/volcano.sh/volcano/pkg/controllers/job/job_controller.go:338 +0x8d8
volcano.sh/volcano/pkg/controllers/job.(*Controller).worker(0xc00045e000, 0xc000000001)
	/mnt/go/src/volcano.sh/volcano/pkg/controllers/job/job_controller.go:260 +0xaa
volcano.sh/volcano/pkg/controllers/job.(*Controller).Run.func1.1()
	/mnt/go/src/volcano.sh/volcano/pkg/controllers/job/job_controller.go:242 +0x31
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc0003e07a0)
	/mnt/go/pkg/mod/k8s.io/apimachinery@v0.16.9-beta.0/pkg/util/wait/wait.go:152 +0x5e
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000c79fa0, 0x3b9aca00, 0x0, 0x1, 0x0)
	/mnt/go/pkg/mod/k8s.io/apimachinery@v0.16.9-beta.0/pkg/util/wait/wait.go:153 +0xf8
k8s.io/apimachinery/pkg/util/wait.Until(...)
	/mnt/go/pkg/mod/k8s.io/apimachinery@v0.16.9-beta.0/pkg/util/wait/wait.go:88
volcano.sh/volcano/pkg/controllers/job.(*Controller).Run.func1(0xc00045e000, 0x0, 0x1)
	/mnt/go/src/volcano.sh/volcano/pkg/controllers/job/job_controller.go:240 +0x7b
created by volcano.sh/volcano/pkg/controllers/job.(*Controller).Run
	/mnt/go/src/volcano.sh/volcano/pkg/controllers/job/job_controller.go:239 +0x341
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1178625]

goroutine 431 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/mnt/go/pkg/mod/k8s.io/apimachinery@v0.16.9-beta.0/pkg/util/runtime/runtime.go:55 +0x105
panic(0x13c30e0, 0x21ba5c0)
	/usr/local/go/src/runtime/panic.go:679 +0x1b2
volcano.sh/volcano/pkg/apis/batch/v1alpha1.(*Job).GetObjectKind(0x0, 0x17c7ac0, 0x0)
	<autogenerated>:1 +0x5
k8s.io/client-go/tools/reference.GetReference(0xc0002f5570, 0x176edc0, 0x0, 0x4dbf0b, 0x7f8eda6fc7e0, 0xc000c78918)
	/mnt/go/pkg/mod/k8s.io/client-go@v0.0.0-20191016111102-bec269661e48/tools/reference/ref.go:59 +0x116
k8s.io/client-go/tools/record.(*recorderImpl).generateEvent(0xc00017ff80, 0x176edc0, 0x0, 0x0, 0xbfb6dc9180116d97, 0x76e438a, 0x21d3d20, 0x1572ce5, 0x7, 0x157844f, ...)
	/mnt/go/pkg/mod/k8s.io/client-go@v0.0.0-20191016111102-bec269661e48/tools/record/event.go:291 +0x5d
k8s.io/client-go/tools/record.(*recorderImpl).Event(0xc00017ff80, 0x176edc0, 0x0, 0x1572ce5, 0x7, 0x157844f, 0xe, 0xc00019e1c0, 0xd2)
	/mnt/go/pkg/mod/k8s.io/client-go@v0.0.0-20191016111102-bec269661e48/tools/record/event.go:313 +0xc2
volcano.sh/volcano/pkg/controllers/job.(*Controller).initiateJob(0xc00045e000, 0x0, 0x0, 0x0, 0x0)
	/mnt/go/src/volcano.sh/volcano/pkg/controllers/job/job_controller_actions.go:152 +0x464
volcano.sh/volcano/pkg/controllers/job.(*Controller).syncJob(0xc00045e000, 0xc0009ba060, 0xc00113e470, 0x0, 0x0)
	/mnt/go/src/volcano.sh/volcano/pkg/controllers/job/job_controller_actions.go:195 +0x3d7
volcano.sh/volcano/pkg/controllers/job/state.(*pendingState).Execute(0xc000bcc000, 0x1572c21, 0x7, 0x2c, 0xc00048bd10)
	/mnt/go/src/volcano.sh/volcano/pkg/controllers/job/state/pending.go:54 +0xbc
volcano.sh/volcano/pkg/controllers/job.(*Controller).processNextReq(0xc00045e000, 0xc000000001, 0x1582200)
	/mnt/go/src/volcano.sh/volcano/pkg/controllers/job/job_controller.go:338 +0x8d8
volcano.sh/volcano/pkg/controllers/job.(*Controller).worker(0xc00045e000, 0xc000000001)
	/mnt/go/src/volcano.sh/volcano/pkg/controllers/job/job_controller.go:260 +0xaa
volcano.sh/volcano/pkg/controllers/job.(*Controller).Run.func1.1()
	/mnt/go/src/volcano.sh/volcano/pkg/controllers/job/job_controller.go:242 +0x31
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc0003e07a0)
	/mnt/go/pkg/mod/k8s.io/apimachinery@v0.16.9-beta.0/pkg/util/wait/wait.go:152 +0x5e
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000c79fa0, 0x3b9aca00, 0x0, 0x1, 0x0)
	/mnt/go/pkg/mod/k8s.io/apimachinery@v0.16.9-beta.0/pkg/util/wait/wait.go:153 +0xf8
k8s.io/apimachinery/pkg/util/wait.Until(...)
	/mnt/go/pkg/mod/k8s.io/apimachinery@v0.16.9-beta.0/pkg/util/wait/wait.go:88
volcano.sh/volcano/pkg/controllers/job.(*Controller).Run.func1(0xc00045e000, 0x0, 0x1)
	/mnt/go/src/volcano.sh/volcano/pkg/controllers/job/job_controller.go:240 +0x7b
created by volcano.sh/volcano/pkg/controllers/job.(*Controller).Run
	/mnt/go/src/volcano.sh/volcano/pkg/controllers/job/job_controller.go:239 +0x341

@k82cn
Copy link
Member

k82cn commented Jun 30, 2020

/cc @Thor-wl

@AHEADer
Copy link

AHEADer commented Jul 1, 2020

meet the same problem. when using latest tag, problem solved.

@hzxuzhonghu
Copy link
Collaborator

Thanks for reporting.

The root cause is Failed to update status of Job default/lm-mpi-job: Job.batch.volcano.sh "lm-mpi-job" is invalid: status.state.lastTransitionTime: Invalid value: "null": status.state.lastTransitionTime in body must be of type string: "null"

Which is fixed by #786,

The panic is cause by the job is nil when err occurs.

	job, err := cc.initJobStatus(job)
	if err != nil {
		cc.recorder.Event(job, v1.EventTypeWarning, string(batch.JobStatusError),
			fmt.Sprintf("Failed to initialize job status, err: %v", err))
		return nil, err
	}

/assign @Thor-wl

@volcano-sh-bot
Copy link
Contributor

@hzxuzhonghu: GitHub didn't allow me to assign the following users: Thor-wl.

Note that only volcano-sh members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

Thanks for reporting.

The root cause is Failed to update status of Job default/lm-mpi-job: Job.batch.volcano.sh "lm-mpi-job" is invalid: status.state.lastTransitionTime: Invalid value: "null": status.state.lastTransitionTime in body must be of type string: "null"

Which is fixed by #786,

The panic is cause by the job is nil when err occurs.

  job, err := cc.initJobStatus(job)
  if err != nil {
  	cc.recorder.Event(job, v1.EventTypeWarning, string(batch.JobStatusError),
  		fmt.Sprintf("Failed to initialize job status, err: %v", err))
  	return nil, err
  }

/assign @Thor-wl

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k82cn
Copy link
Member

k82cn commented Jul 3, 2020

It's fixed in latest code of release-0.4, we're going to have 0.4.1 release to fix that.

@AHEADer
Copy link

AHEADer commented Jul 3, 2020

can u release a new image for this ?

@hzxuzhonghu hzxuzhonghu added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Jul 3, 2020
@Thor-wl
Copy link
Contributor

Thor-wl commented Jul 3, 2020

fix PR: #901

@k82cn
Copy link
Member

k82cn commented Jul 5, 2020

can u release a new image for this ?

Yes, we're going to have a new release (e.g. v0.4.1) this month.

@Thor-wl
Copy link
Contributor

Thor-wl commented Jul 9, 2020

can u release a new image for this ?

Yes, we're going to have a new release (e.g. v0.4.1) this month.

bug fix has already merged to release-0.4

@k82cn
Copy link
Member

k82cn commented Jul 9, 2020

@Thor-wl , we should have a v0.4.1 release to fix that.

@hzxuzhonghu
Copy link
Collaborator

0.4.1 has been released, please have a try @merryzhou

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
Development

No branches or pull requests

6 participants