Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete an success pod, will cause volcano recreate an new pod to do repeat task #791

Closed
gaopeiliang opened this issue May 7, 2020 · 17 comments
Labels
area/controllers kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Milestone

Comments

@gaopeiliang
Copy link

What happened:
when job is running , one pod task successful completed, and delete pod some reason, then another new pod will be create to repeat do same task ....

What you expected to happen:
ignore successful completed delete pod event ,, and not create new same pod....

How to reproduce it (as minimally and precisely as possible):

  1. running job , with short running pod successful completed and one long running pod..
  2. delete short running pod when completed ....

it will created an new short running pod to do same work

Anything else we need to know?:

Environment:

  • Volcano Version:
    release-0.4

  • Kubernetes version (use kubectl version):
    1.14

  • OS (e.g. from /etc/os-release):
    ubuntu:16

  • Kernel (e.g. uname -a):
    Linux 4.9.70-040970-generic

  • Install tools:

  • Others:

@gaopeiliang
Copy link
Author

gaopeiliang commented May 7, 2020

`
func (cc *Controller) deletePod(obj interface{}) {

    if err := cc.cache.DeletePod(pod); err != nil {
	klog.Errorf("Failed to delete Pod <%s/%s>: %v in cache",
		pod.Namespace, pod.Name, err)
}

key := jobhelpers.GetJobKeyByReq(&req)
queue := cc.getWorkerQueue(key)
queue.Add(req)

}
// here received delete pod event => delete cache ,, (but this pod is an successful completed and deleted some reason)

func (cc *Controller) processNextReq(count uint32) bool {

       cloneJOBINFO
       running => syncJob

}

func (cc *Controller) syncJob(jobInfo *apis.JobInfo, updateStatus state.UpdateStatusFn) error {

      	for i := 0; i < int(ts.Replicas); i++ {
		podName := fmt.Sprintf(jobhelpers.PodNameFmt, job.Name, name, i)
		if pod, found := pods[podName]; !found {
			newPod := createJobPod(job, tc, i)
                           
                          _// HERE will created new , because it not find in cache , even though it has 
                          //  successful completed ....._


			if err := cc.pluginOnPodCreate(job, newPod); err != nil {
				return err
			}
			podToCreate = append(podToCreate, newPod)
		} else {
			delete(pods, podName)
			if pod.DeletionTimestamp != nil {
				klog.Infof("Pod <%s/%s> is terminating", pod.Namespace, pod.Name)
				atomic.AddInt32(&terminating, 1)
				continue
			}

			classifyAndAddUpPodBaseOnPhase(pod, &pending, &running, &succeeded, &failed, &unknown)
		}
	}

}

`

@hzxuzhonghu
Copy link
Collaborator

@gaopeiliang Yeah, that's the current behavior, we can cache the completed pods and in the sync handler, we ignore them during reconcile.

Would appreciate it if you can help.

@hzxuzhonghu hzxuzhonghu added area/controllers kind/bug Categorizes issue or PR as related to a bug. labels May 8, 2020
@hzxuzhonghu hzxuzhonghu added this to the v1.0 milestone May 8, 2020
@k82cn
Copy link
Member

k82cn commented May 8, 2020

Good catch ! We did not check task's status in this case :)

@vincent-pli
Copy link

So, if check the status of pod, not lunch the request if the status.phase is Succeeded, could this fix the issue?

@k82cn
Copy link
Member

k82cn commented May 11, 2020

hm... I think we should handle this in syncJob func :)

@vincent-pli
Copy link

@k82cn
I think handle in syncJob maybe cannot fix the issue:
see the sample:

policies:
- event: PodEvicted
action: RestartJob

The PodEvicted event will cause action: RestartJob, cannot get to syncJob.
The "delete after success" is not a Evicted actually.

@hzxuzhonghu
Copy link
Collaborator

In this case, we should ignore the delete event.

@k82cn
Copy link
Member

k82cn commented May 12, 2020

The PodEvicted event will cause action: RestartJob, cannot get to syncJob. The "delete after success" is not a Evicted actually.

👍 , yes; so maybe we only record an event and ignore this action :)

@vincent-pli
Copy link

@k82cn @hzxuzhonghu
Thanks for clarify, I'm new for this project, maybe I missing something.
So I wonder to know why insist on launching request with PodEvictedEvent in a scenario not Evicted? Is no-launch case issue somewhere, thanks.

@k82cn
Copy link
Member

k82cn commented May 12, 2020

oh, that's ok to me for other options :) Please help to open a PR for that, we can discuss detail there.

@gaopeiliang
Copy link
Author

what happen if we ignore all delete pod event ?

func (cc *Controller) deletePod(obj interface{}){
        .........

	//if err := cc.cache.DeletePod(pod); err != nil {
	//	klog.Errorf("Failed to delete Pod <%s/%s>: %v in cache",
	//		pod.Namespace, pod.Name, err)
	//}
      
       ......
}

maybe it would work ......

@vincent-pli
Copy link

@gaopeiliang
It will stop create a new pod but will introduce new issue, when delete related job then create again, the previous deleted task(pod) will never create again, I guess it cause by ignore delete cache...

When avoid emit PodEvicted event, the issue not fixed, since we get another request:

<Queue: , Job: default/test-job, Task:, Event:Unknown, ExitCode:0, Action:, JobVersion: 0>

it will trigger syncJob.
The request emit by event handler of podGroup update, that's because some one update PodGroup when delete success pod, I still checking who did this... maybe scheduler? need help.

@gaopeiliang
Copy link
Author

gaopeiliang commented May 13, 2020

what is the meaning delete related job then create again? create an new job? delete job will clean all about in cache ..

@vincent-pli
Copy link

@gaocegege
yes, delete then create same job, but based on my test, the previous deleted task gone, never back.

vincent-pli added a commit to vincent-pli/volcano that referenced this issue May 13, 2020
Try to address issue volcano-sh#791
It's a draft solution, need further discussion.
vincent-pli added a commit to vincent-pli/volcano that referenced this issue May 13, 2020
Try to address issue volcano-sh#791
It's a draft solution, need further discussion.

Signed-off-by: pengli <justdoit.pli@gmail.com>
vincent-pli added a commit to vincent-pli/volcano that referenced this issue May 13, 2020
Try to address issue volcano-sh#791
It's a draft solution, need further discussion.

Signed-off-by: pengli <justdoit.pli@gmail.com>
@gaopeiliang
Copy link
Author

func (cc *Controller) deletePod(obj interface{}){
       ......
        
	if pod.Status.Phase != v1.PodSucceeded {
		if err := cc.cache.DeletePod(pod); err != nil {
			klog.Errorf("Failed to delete Pod <%s/%s>: %v in cache",
				pod.Namespace, pod.Name, err)
		}
	}
     ......
}

even though it make code not clean

vincent-pli added a commit to vincent-pli/volcano that referenced this issue May 17, 2020
Try to address issue volcano-sh#791
It's a draft solution, need further discussion.

Signed-off-by: pengli <justdoit.pli@gmail.com>
vincent-pli added a commit to vincent-pli/volcano that referenced this issue May 18, 2020
Try to address issue volcano-sh#791
It's a draft solution, need further discussion.

Signed-off-by: pengli <justdoit.pli@gmail.com>
vincent-pli added a commit to vincent-pli/volcano that referenced this issue May 21, 2020
Try to address issue volcano-sh#791
It's a draft solution, need further discussion.

Signed-off-by: pengli <justdoit.pli@gmail.com>
vincent-pli added a commit to vincent-pli/volcano that referenced this issue May 26, 2020
Try to address issue volcano-sh#791
It's a draft solution, need further discussion.

Signed-off-by: pengli <justdoit.pli@gmail.com>
vincent-pli added a commit to vincent-pli/volcano that referenced this issue May 27, 2020
Try to address issue volcano-sh#791
It's a draft solution, need further discussion.

Signed-off-by: pengli <justdoit.pli@gmail.com>
vincent-pli added a commit to vincent-pli/volcano that referenced this issue May 28, 2020
Try to address issue volcano-sh#791
It's a draft solution, need further discussion.

Signed-off-by: pengli <justdoit.pli@gmail.com>
vincent-pli added a commit to vincent-pli/volcano that referenced this issue May 28, 2020
Try to address issue volcano-sh#791
It's a draft solution, need further discussion.

Signed-off-by: pengli <justdoit.pli@gmail.com>
@Thor-wl Thor-wl mentioned this issue May 29, 2020
vincent-pli added a commit to vincent-pli/volcano that referenced this issue Jun 17, 2020
Try to address issue volcano-sh#791
It's a draft solution, need further discussion.

Signed-off-by: pengli <justdoit.pli@gmail.com>
@stale
Copy link

stale bot commented Aug 18, 2020

Hello 👋 Looks like there was no activity on this issue for last 90 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

@stale stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 18, 2020
@stale
Copy link

stale bot commented Oct 17, 2020

Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗

@stale stale bot closed this as completed Oct 17, 2020
chtcvl pushed a commit to ghpr-asia/volcano that referenced this issue Jul 11, 2021
Try to address issue volcano-sh#791
It's a draft solution, need further discussion.

Signed-off-by: pengli <justdoit.pli@gmail.com>
chtcvl pushed a commit to ghpr-asia/volcano that referenced this issue Jul 16, 2021
Try to address issue volcano-sh#791
It's a draft solution, need further discussion.

Signed-off-by: pengli <justdoit.pli@gmail.com>
chtcvl pushed a commit to ghpr-asia/volcano that referenced this issue Aug 16, 2021
Try to address issue volcano-sh#791
It's a draft solution, need further discussion.

Signed-off-by: pengli <justdoit.pli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controllers kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

4 participants