control plane node (or etcd) outage causes akash provider to close all the leases #245

andy108369 · 2024-08-09T20:33:14Z

Two providers reported that their provider closed ALL leases when control plane node was offline.

7 Aug 2024 - DCNorse provider - control plane node crashed (was down for some time).
9 Aug 2024 - Europlots provider - control plane node was unavailable for 15 minutes during provider migration.

Both have only 1 etcd and 1 control plane in total (etcd is running on the control plane node).

My guess is that either is happening:
A. akash-provider can't reach K8s to query the deployments in Running state, which triggers the monitorMaxRetries counter and closes the leases after 8 minutes.

B. akash-provider can't reach etcd (since it's also running on control plane nodes) to query the manifests CRD and decides to remove them;

When akash-provider does not have enough information (from control plane / etcd) about the deployments (either Pods or Deployment / StatefulSet resources, and their state), it should not make critical decisions (such as to terminate all the leases). Ideally, it should ensure K8s/etcd is healthy in the first place.

I guess the biggest issue is if akash-provider receives wrong information from the control plane / etcd for whatever reason, such as that K8s has no Akash Deployments running at all. But I believe this wasn't the case.

The text was updated successfully, but these errors were encountered:

andy108369 · 2024-12-23T08:52:43Z

Side note: I've opened a discussion to what might be a potential contributor (not the root cause) to this issue https://github.com/orgs/akash-network/discussions/760

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

control plane node (or etcd) outage causes akash provider to close all the leases #245

control plane node (or etcd) outage causes akash provider to close all the leases #245

andy108369 commented Aug 9, 2024 •

edited

Loading

andy108369 commented Dec 23, 2024

control plane node (or etcd) outage causes akash provider to close all the leases #245

control plane node (or etcd) outage causes akash provider to close all the leases #245

Comments

andy108369 commented Aug 9, 2024 • edited Loading

andy108369 commented Dec 23, 2024

andy108369 commented Aug 9, 2024 •

edited

Loading