Skip to content
This repository has been archived by the owner on Dec 4, 2024. It is now read-only.

[BUG] - Operator Pod Error Loops if the Pod has to recreate #22

Open
odellem opened this issue Aug 1, 2024 · 0 comments
Open

[BUG] - Operator Pod Error Loops if the Pod has to recreate #22

odellem opened this issue Aug 1, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@odellem
Copy link

odellem commented Aug 1, 2024

Describe the bug
After applying a payload to the slurm cluster, the operator creates the daemonset for slurmabler pods. However, if the pod crashes or restarts, it will error loop because the daemonset already exists.

To Reproduce
Steps to reproduce the behavior:

  1. Install Slik
  2. Apply either payload
  3. Let the slurmabler pods be created
  4. Delete the operator pod, allowing the deployment to recreate it, and check the logs for the error loop.

Expected behavior
It should handle errors gracefully, or if there is an issue where the daemonset needs to be created, then the operator should just delete and then recreate the daemonset.

Additional context
Deleting the daemonset and restarting the operator pod will fix the problem but when you upgrade a cluster pods will be moved around during the rolling update, therefore any cluster upgrade will break the slurm operator.

@odellem odellem added the bug Something isn't working label Aug 1, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant