Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autofailover Stuck in Unable to Find Pod Status After graphd scale down #529

Closed
kevinliu24 opened this issue Oct 7, 2024 · 1 comment · Fixed by #530
Closed

Autofailover Stuck in Unable to Find Pod Status After graphd scale down #529

kevinliu24 opened this issue Oct 7, 2024 · 1 comment · Fixed by #530
Assignees
Labels
affects/v1.8 PR/Issue: this bug affects v1.8.x version. process/fixed Process of bug severity/minor Severity of bug type/bug Type: something is unexpected

Comments

@kevinliu24
Copy link
Contributor

kevinliu24 commented Oct 7, 2024

Please check the FAQ documentation before raising an issue

Describe the bug (required)

Snap reported that graphd failed to scale up and start new pods after nebula autoscaler increased the number of graphd replicas from 2 to 4. The number of desired replicas in kubectl describe for both the autoscaler and the nebula cluster are correct, but no new pods were started. Further investigation reviewed the error E1007 18:17:25.249973 1 nebula_cluster_controller.go:196] NebulaCluster [cb/cb] reconcile failed: rebuilt graphd pod [cb/cb-graphd-2] not found, skip in the operator log which was thrown during auto failover when checking the status of new pods. Also kubectl get pods revels only 2 graphd pods. This happened due to the following sequence:

  1. Auto failover was triggered for a graphd pod due to a failure such as node down.
  2. A new pod was started, but in pending state
  3. Before the pod went into running state, nebula autoscaler was triggered to scale down graphd causing the new pod to be terminated.
  4. However the new pod was never removed from the auto failover map because the new pod was never in running state (currently auto failover only removes the pod from its map when it reaches running state).
  5. As a result, auto failover gets stuck trying to look for a pod that doesn't exist. The new graphd pods fail to start because currently scaling happens after auto failover succeeds.

Solution: Remove the pod from the auto failover map when it's terminated.

Related logs are attached below.
Snap-na-describe-output.txt
cb_nc.txt
controller-manager-logs.txt
Snap-nc-pods-output.txt

Your Environments (required)

  • Any kubernetes cluster with local-pv and nebula-scheduler

How To Reproduce(required)

Steps to reproduce the behavior:

  1. Start a kubernetes cluster and deploy nebula graph with multiple graphd pods. Make sure auto failover and local pv are turned on.
  2. Deploy nebula autoscaler with maximum replicas >= the current graphd replicas.
  3. Cordon a node running a graphd pod and wait for the affected graphd pod to go into pending status.
  4. Modify the nebula autoscaler and set maximum replicas to < the current graphd replicas.
  5. Wait for the new pod should be terminated.
  6. Modify the nebula autoscaler again and set maximum replicas to > the current graphd replicas.
  7. New pods will fail to start and the cluster will be stuck in auto failover state with the error in the description in the operator log.

Expected behavior

Graphd should scale up and start new pods successfully

Additional context

All related logs and cluster config is attached

@kevinliu24 kevinliu24 added the type/bug Type: something is unexpected label Oct 7, 2024
@kevinliu24 kevinliu24 self-assigned this Oct 7, 2024
@github-actions github-actions bot added affects/none PR/issue: this bug affects none version. severity/none Severity of bug labels Oct 7, 2024
@kevinliu24
Copy link
Contributor Author

kevinliu24 commented Oct 7, 2024

Currently had Snap work around by doing the following but this still needs to be fixed:

  1. Run kubectl edit nc and set Enable Auto Failover to false. This will allow the operator to get out of the loop and do the scaling.
  2. Wait for the new pods to start up.
  3. Then run kubctl edit nc again and set Enable Auto Failover to true again. Auto failover should automatically clear the failed pod since it's now in running state.

@kevinliu24 kevinliu24 added affects/v1.8 PR/Issue: this bug affects v1.8.x version. severity/minor Severity of bug and removed affects/none PR/issue: this bug affects none version. severity/none Severity of bug labels Oct 7, 2024
@kevinliu24 kevinliu24 linked a pull request Oct 21, 2024 that will close this issue
3 tasks
@github-actions github-actions bot added the process/fixed Process of bug label Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects/v1.8 PR/Issue: this bug affects v1.8.x version. process/fixed Process of bug severity/minor Severity of bug type/bug Type: something is unexpected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant