Autofailover Stuck in Unable to Find Pod Status After graphd scale down #529

kevinliu24 · 2024-10-07T22:52:55Z

Please check the FAQ documentation before raising an issue

Describe the bug (required)

Snap reported that graphd failed to scale up and start new pods after nebula autoscaler increased the number of graphd replicas from 2 to 4. The number of desired replicas in kubectl describe for both the autoscaler and the nebula cluster are correct, but no new pods were started. Further investigation reviewed the error E1007 18:17:25.249973 1 nebula_cluster_controller.go:196] NebulaCluster [cb/cb] reconcile failed: rebuilt graphd pod [cb/cb-graphd-2] not found, skip in the operator log which was thrown during auto failover when checking the status of new pods. Also kubectl get pods revels only 2 graphd pods. This happened due to the following sequence:

Auto failover was triggered for a graphd pod due to a failure such as node down.
A new pod was started, but in pending state
Before the pod went into running state, nebula autoscaler was triggered to scale down graphd causing the new pod to be terminated.
However the new pod was never removed from the auto failover map because the new pod was never in running state (currently auto failover only removes the pod from its map when it reaches running state).
As a result, auto failover gets stuck trying to look for a pod that doesn't exist. The new graphd pods fail to start because currently scaling happens after auto failover succeeds.

Solution: Remove the pod from the auto failover map when it's terminated.

Related logs are attached below.
Snap-na-describe-output.txt
cb_nc.txt
controller-manager-logs.txt
Snap-nc-pods-output.txt

Your Environments (required)

Any kubernetes cluster with local-pv and nebula-scheduler

How To Reproduce(required)

Steps to reproduce the behavior:

Start a kubernetes cluster and deploy nebula graph with multiple graphd pods. Make sure auto failover and local pv are turned on.
Deploy nebula autoscaler with maximum replicas >= the current graphd replicas.
Cordon a node running a graphd pod and wait for the affected graphd pod to go into pending status.
Modify the nebula autoscaler and set maximum replicas to < the current graphd replicas.
Wait for the new pod should be terminated.
Modify the nebula autoscaler again and set maximum replicas to > the current graphd replicas.
New pods will fail to start and the cluster will be stuck in auto failover state with the error in the description in the operator log.

Expected behavior

Graphd should scale up and start new pods successfully

Additional context

All related logs and cluster config is attached

The text was updated successfully, but these errors were encountered:

kevinliu24 · 2024-10-07T22:56:02Z

Currently had Snap work around by doing the following but this still needs to be fixed:

Run kubectl edit nc and set Enable Auto Failover to false. This will allow the operator to get out of the loop and do the scaling.
Wait for the new pods to start up.
Then run kubctl edit nc again and set Enable Auto Failover to true again. Auto failover should automatically clear the failed pod since it's now in running state.

kevinliu24 added the type/bug Type: something is unexpected label Oct 7, 2024

kevinliu24 self-assigned this Oct 7, 2024

github-actions bot added affects/none PR/issue: this bug affects none version. severity/none Severity of bug labels Oct 7, 2024

kevinliu24 added affects/v1.8 PR/Issue: this bug affects v1.8.x version. severity/minor Severity of bug and removed affects/none PR/issue: this bug affects none version. severity/none Severity of bug labels Oct 7, 2024

wey-gu mentioned this issue Oct 12, 2024

Weekly Report 2024-10-11 vesoft-inc/nebula-community#459

Open

kevinliu24 mentioned this issue Oct 19, 2024

fix(operator): remove unnecessary pod from autofailover list #530

Merged

3 tasks

kevinliu24 linked a pull request Oct 21, 2024 that will close this issue

fix(operator): remove unnecessary pod from autofailover list #530

Merged

3 tasks

kevinliu24 closed this as completed Oct 21, 2024

github-actions bot added the process/fixed Process of bug label Oct 21, 2024

wey-gu mentioned this issue Oct 26, 2024

Weekly Report 2024-10-25 vesoft-inc/nebula-community#460

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autofailover Stuck in Unable to Find Pod Status After graphd scale down #529

Autofailover Stuck in Unable to Find Pod Status After graphd scale down #529

kevinliu24 commented Oct 7, 2024 •

edited

Loading

kevinliu24 commented Oct 7, 2024 •

edited

Loading

Autofailover Stuck in Unable to Find Pod Status After graphd scale down #529

Autofailover Stuck in Unable to Find Pod Status After graphd scale down #529

Comments

kevinliu24 commented Oct 7, 2024 • edited Loading

kevinliu24 commented Oct 7, 2024 • edited Loading

kevinliu24 commented Oct 7, 2024 •

edited

Loading

kevinliu24 commented Oct 7, 2024 •

edited

Loading