Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

control plane node (or etcd) outage causes akash provider to close all the leases #245

Open
andy108369 opened this issue Aug 9, 2024 · 0 comments

Comments

@andy108369
Copy link
Contributor

andy108369 commented Aug 9, 2024

Two providers reported that their provider closed ALL leases when control plane node was offline.

  • 7 Aug 2024 - DCNorse provider - control plane node crashed (was down for some time).
  • 9 Aug 2024 - Europlots provider - control plane node was unavailable for 15 minutes during provider migration.

Both have only 1 etcd and 1 control plane in total (etcd is running on the control plane node).

My guess is that either is happening:
A. akash-provider can't reach K8s to query the deployments in Running state, which triggers the monitorMaxRetries counter and closes the leases after 8 minutes.

B. akash-provider can't reach etcd (since it's also running on control plane nodes) to query the manifests CRD and decides to remove them;

When akash-provider does not have enough information (from control plane / etcd) about the deployments (either Pods or Deployment / StatefulSet resources, and their state), it should not make critical decisions (such as to terminate all the leases). Ideally, it should ensure K8s/etcd is healthy in the first place.

I guess the biggest issue is if akash-provider receives wrong information from the control plane / etcd for whatever reason, such as that K8s has no Akash Deployments running at all. But I believe this wasn't the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant