Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Using active deadline does not wait for container to terminate before finishing #64

Closed
irvinlim opened this issue Apr 18, 2022 · 0 comments · Fixed by #85
Closed
Labels
area/workloads Related to workload execution (e.g. jobs, tasks) component/execution Issues or PRs related exclusively to the Execution component (Job, JobConfig) kind/bug Categorizes issue or PR as related to a bug.
Milestone

Comments

@irvinlim
Copy link
Member

irvinlim commented Apr 18, 2022

Currently, we use active deadlines to kill Pods, which apparently does not ensure that the container is already terminated before

The following JobConfig allows us to replicate this issue. We use https://github.com/irvinlim/signalbin to test interactions with signal handlers.

apiVersion: execution.furiko.io/v1alpha1
kind: JobConfig
metadata:
  name: jobconfig-signal-demo
spec:
  concurrency:
    policy: Allow
  template:
    spec:
      task:
        template:
          spec:
            containers:
              - name: job-container
                image: irvinlim/signalbin
                args:
                  - SIGINT,SIGTERM
                  - -sq
                  - -t=120s

When killing the Job with killTimestamp, we see that the Job had reached a Killed phase even while the container is running.

Screen Shot 2022-04-19 at 03 45 58

Once the container is complete, we can see that the logs stopped (meaning that the container exited), and the Pod's containerStatuses running moved to terminated.

Screen Shot 2022-04-19 at 03 46 28

The implications of this include:

  • Incorrect handling of graceful termination (it appears to have immediately terminated, rather than gracefully shutdown)
  • Concurrency policy may be violated (container is pending termination, but another job has started)

Possible solutions:

  1. Easy fix: Do not depend on the status.phase of the Pod to determine the task state. In this case, we need to look at the containerStatuses AND the phase to determine if all containers are dead AND they will not be recreated.
  2. Abandon the active deadline approach: There are other problems with using active deadline and force deletion at the same time. Alternatively, we could keep the active deadline behavior behind a config/feature flag.
@irvinlim irvinlim added kind/bug Categorizes issue or PR as related to a bug. component/execution Issues or PRs related exclusively to the Execution component (Job, JobConfig) area/workloads Related to workload execution (e.g. jobs, tasks) labels Apr 18, 2022
@irvinlim irvinlim added this to the v0.2.0 milestone Jun 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/workloads Related to workload execution (e.g. jobs, tasks) component/execution Issues or PRs related exclusively to the Execution component (Job, JobConfig) kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant