Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(controller): Workflow hangs indefinitely during ContainerCreating if the Pod or Node unexpectedly dies #5585

Merged
merged 8 commits into from
Apr 7, 2021

Conversation

sarabala1979
Copy link
Member

Signed-off-by: Saravanan Balasubramanian sarabala1979@gmail.com

Checklist:

… if the Pod or Node unexpectedly dies

Signed-off-by: Saravanan Balasubramanian <sarabala1979@gmail.com>
@codecov
Copy link

codecov bot commented Apr 2, 2021

Codecov Report

Merging #5585 (a243bf5) into master (4e450e2) will increase coverage by 0.02%.
The diff coverage is 53.57%.

❗ Current head a243bf5 differs from pull request most recent head 2a7ea4e. Consider uploading reports for the commit 2a7ea4e to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5585      +/-   ##
==========================================
+ Coverage   47.01%   47.04%   +0.02%     
==========================================
  Files         240      240              
  Lines       15002    15021      +19     
==========================================
+ Hits         7053     7066      +13     
- Misses       7048     7059      +11     
+ Partials      901      896       -5     
Impacted Files Coverage Δ
workflow/artifacts/oss/oss.go 7.89% <0.00%> (-0.33%) ⬇️
workflow/executor/resource.go 22.94% <0.00%> (-0.70%) ⬇️
workflow/artifacts/git/git.go 40.21% <85.71%> (+3.43%) ⬆️
pkg/apis/workflow/v1alpha1/workflow_types.go 46.04% <100.00%> (+0.25%) ⬆️
workflow/controller/operator.go 71.17% <100.00%> (+0.48%) ⬆️
server/workflow/workflow_server.go 40.45% <0.00%> (-2.28%) ⬇️
cmd/argo/commands/get.go 56.66% <0.00%> (+0.66%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4e450e2...2a7ea4e. Read the comment docs.

@sarabala1979
Copy link
Member Author

Pod will be in the pending status if any of the container is not in containerReady state. Code is setting pending status if container in waiting state.

for _, c := range pod.Status.ContainerStatuses {
ctrNodeName := fmt.Sprintf("%s.%s", node.Name, c.Name)
if woc.wf.GetNodeByName(ctrNodeName) == nil {
continue
}
switch {
case c.State.Waiting != nil:
woc.markNodePhase(ctrNodeName, wfv1.NodePending)
case c.State.Running != nil:
woc.markNodePhase(ctrNodeName, wfv1.NodeRunning)
case c.State.Terminated != nil:
exitCode := int(c.State.Terminated.ExitCode)
message := fmt.Sprintf("%s (exit code %d): %s", c.State.Terminated.Reason, exitCode, c.State.Terminated.Message)
switch exitCode {
case 0:
woc.markNodePhase(ctrNodeName, wfv1.NodeSucceeded)

Cont>

Name:           my-nginx-9d5677d94-g44l6 Namespace:      default Node: kubenode1/10.1.88.22 Start Time:     Tue, 06 Mar 2018 08:24:13

+0000 Labels:
pod-template-hash=581233850
run=my-nginx Annotations:
Status: Pending
IP: Controlled By: ReplicaSet/my-nginx-9d5677d94 Containers:
my-nginx:
Container ID:
Image: nginx
Image ID:
Port: 80/TCP
> State: Waiting
Reason: ContainerCreating

Ready: False
Restart Count: 0
Environment: roller is checking only pod status in code.

@sarabala1979
Copy link
Member Author

@alexec Do you think we can move nodestatus to nodeRunning when the container state is is in waiting? Currently code is keeping nodepending.
https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle


Container states
As well as the phase of the Pod overall, Kubernetes tracks the state of each container inside a Pod. You can use container lifecycle hooks to trigger events to run at certain points in a container's lifecycle.

Once the scheduler assigns a Pod to a Node, the kubelet starts creating containers for that Pod using a container runtime. There are three possible container states: Waiting, Running, and Terminated.

To check the state of a Pod's containers, you can use kubectl describe pod <name-of-pod>. The output shows the state for each container within that Pod.

Each state has a specific meaning:

Waiting
If a container is not in either the Running or Terminated state, it is Waiting. A container in the Waiting state is still running the operations it requires in order to complete start up: for example, pulling the container image from a container image registry, or applying Secret data. When you use kubectl to query a Pod with a container that is Waiting, you also see a Reason field to summarize why the container is in that state.

Signed-off-by: Saravanan Balasubramanian <sarabala1979@gmail.com>
Signed-off-by: Saravanan Balasubramanian <sarabala1979@gmail.com>
Signed-off-by: Saravanan Balasubramanian <sarabala1979@gmail.com>
Signed-off-by: Saravanan Balasubramanian <sarabala1979@gmail.com>
@@ -975,7 +975,8 @@ func (woc *wfOperationCtx) podReconciliation(ctx context.Context) error {

// If the node is pending and the pod does not exist, it could be the case that we want to try to submit it
// again instead of marking it as an error. Check if that's the case.
if node.Pending() {
// Node will be in pending state without Pod create if Node is waiting for Synchronize lock
if node.Pending() && !node.IsPodCreated() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one valid scenario where node status will be pending without pod creation. It is the template that is waiting for sync lock to acquire.

@alexec
Copy link
Contributor

alexec commented Apr 5, 2021

pod.status.phase is not a good way to understand the state of the pod.

So what is better?

@sarabala1979 I'm not sure what to do right now, I did wonder if gitops-engine is correct:

https://github.com/argoproj/gitops-engine/blob/master/pkg/health/health_pod.go#L14

But no, not correct.

What about kubect get pod - how does that determine STATUS?

 k get pod
NAME                     READY   STATUS    RESTARTS   AGE
minio-657cbb9d6d-2225w   1/1     Running   0          30h
nats-0                   1/1     Running   0          19h
stan-0                   1/1     Running   0          19h

I know that we can be running, but status is things like

  • CrashloopBackoff
  • NotReady
  • ErrImagePullBackOff

Do you know where to find the source code?

@alexec
Copy link
Contributor

alexec commented Apr 5, 2021

To make this worse - a "waiting" container can be in error state and not recoverable!

Signed-off-by: Saravanan Balasubramanian <sarabala1979@gmail.com>
@@ -976,9 +976,10 @@ func (woc *wfOperationCtx) podReconciliation(ctx context.Context) error {
// If the node is pending and the pod does not exist, it could be the case that we want to try to submit it
// again instead of marking it as an error. Check if that's the case.
// Node will be in pending state without Pod create if Node is waiting for Synchronize lock
if node.Pending() && !node.IsPodCreated() {
if node.Pending() && node.GetPendingReason() == wfv1.WaitingForSyncLock {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be || not &&?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be &&

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The node should not fail/error if the node is in a pending state with WaitforSyncLock.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure? && means that any node that is pending and not waiting for lock may be marked as error

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. I am sure. There is another scenario (pod. is not updated to informer) which will cover in recentlyStarted

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simster7 - can I request another opinion on this? I feel like I'm going mad.

@sarabala1979 sarabala1979 marked this pull request as ready for review April 6, 2021 16:23
pkg/apis/workflow/v1alpha1/workflow_types.go Outdated Show resolved Hide resolved
pkg/apis/workflow/v1alpha1/workflow_types.go Outdated Show resolved Hide resolved
pkg/apis/workflow/v1alpha1/workflow_types.go Show resolved Hide resolved
@@ -976,9 +976,10 @@ func (woc *wfOperationCtx) podReconciliation(ctx context.Context) error {
// If the node is pending and the pod does not exist, it could be the case that we want to try to submit it
// again instead of marking it as an error. Check if that's the case.
// Node will be in pending state without Pod create if Node is waiting for Synchronize lock
if node.Pending() && !node.IsPodCreated() {
if node.Pending() && node.GetPendingReason() == wfv1.WaitingForSyncLock {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure? && means that any node that is pending and not waiting for lock may be marked as error

@alexec alexec self-assigned this Apr 6, 2021
Signed-off-by: Saravanan Balasubramanian <sarabala1979@gmail.com>
@sarabala1979 sarabala1979 merged commit 0edd32b into argoproj:master Apr 7, 2021
@simster7 simster7 mentioned this pull request Apr 19, 2021
50 tasks
simster7 pushed a commit that referenced this pull request Apr 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Workflow hangs indefinitely during "ContainerCreating" if the Pod or Node unexpectedly dies
2 participants